IST-DASLab/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Marlin helps engineers and researchers working with large language models (LLMs) accelerate how quickly their models can generate responses. It takes your existing FP16xINT4 quantized LLM weights and processes them to produce significantly faster inference, especially when handling multiple user requests at once. This tool is for those who deploy and manage LLMs and need to serve many users efficiently.
1,039 stars. No commits in the last 6 months.
Use this if you need to dramatically speed up the inference performance of your large language models, particularly when serving a medium number of simultaneous user requests or running advanced decoding strategies.
Not ideal if you are working with older GPU hardware or if your primary goal is not accelerating LLM inference.
Stars
1,039
Forks
86
Language
Python
License
Apache-2.0
Category
Last pushed
Sep 04, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/IST-DASLab/marlin"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
quic/efficient-transformers
This library empowers users to seamlessly port pretrained models and checkpoints on the...
ManuelSLemos/RabbitLLM
Run 70B+ LLMs on a single 4GB GPU — no quantization required.
alpa-projects/alpa
Training and serving large-scale neural networks with auto parallelization.
arm-education/Advanced-AI-Hardware-Software-Co-Design
Hands-on course materials for ML engineers to master extreme model quantization and on-device...
deepreinforce-ai/CUDA-L2
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning