jiseokson/PageBrain

Light-weight LLM Serving with PagedAttention

/ 100

Experimental

This is a tool for developers who are building or running applications that use large language models (LLMs). It helps manage the memory used by LLMs to generate text, making it more efficient to handle multiple user requests simultaneously on a single GPU. It takes in a standard HuggingFace LLM and outputs a more memory-efficient version that can serve many requests.

Use this if you are a developer looking for an educational, hackable reference implementation of modern LLM serving techniques for research or integration into your Python application.

Not ideal if you are an end-user simply looking to interact with an existing LLM or a developer who needs a production-ready, highly optimized LLM serving solution out-of-the-box without wanting to dive into its internals.

LLM-serving GPU-optimization model-deployment AI-infrastructure deep-learning-research

No License No Package No Dependents

Maintenance 6 / 25

Adoption 6 / 25

Maturity 5 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

—

Higher-rated alternatives

tabularis-ai/be_great

A novel approach for synthesizing tabular data using pretrained large language models

EleutherAI/gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron...

shibing624/textgen

TextGen: Implementation of Text Generation models, include LLaMA, BLOOM, GPT2, BART, T5, SongNet...

ai-forever/ru-gpts

Russian GPT3 models.

AdityaNG/kan-gpt

The PyTorch implementation of Generative Pre-trained Transformers (GPTs) using Kolmogorov-Arnold...

Explore Transformer Models

All categories Trending Transformer directory Insights