itsnamgyu/block-transformer
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
This project offers a new way to build large language models (LLMs) that generate text much faster. It takes text data for training and outputs a language model capable of significantly quicker text generation compared to standard models with similar quality. This is for AI researchers and machine learning engineers who develop and deploy large language models.
163 stars. No commits in the last 6 months.
Use this if you are developing or deploying large language models and need to achieve 10-20x faster text generation throughput without sacrificing output quality.
Not ideal if you are looking for a pre-trained model for direct use without custom training or fine-tuning, or if you don't work with large-scale language model inference.
Stars
163
Forks
9
Language
Python
License
MIT
Category
Last pushed
Apr 13, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/itsnamgyu/block-transformer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
LMCache/LMCache
Supercharge Your LLM with the Fastest KV Cache Layer
Zefan-Cai/KVCache-Factory
Unified KV Cache Compression Methods for Auto-Regressive Models
dataflowr/llm_efficiency
KV Cache & LoRA for minGPT
OnlyTerp/kvtc
First open-source KVTC implementation (NVIDIA, ICLR 2026) -- 8-32x KV cache compression via PCA...
OnlyTerp/turboquant
First open-source implementation of Google TurboQuant (ICLR 2026) -- near-optimal KV cache...