IST-DASLab/marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

/ 100

Emerging

Marlin helps engineers and researchers working with large language models (LLMs) accelerate how quickly their models can generate responses. It takes your existing FP16xINT4 quantized LLM weights and processes them to produce significantly faster inference, especially when handling multiple user requests at once. This tool is for those who deploy and manage LLMs and need to serve many users efficiently.

1,039 stars. No commits in the last 6 months.

Use this if you need to dramatically speed up the inference performance of your large language models, particularly when serving a medium number of simultaneous user requests or running advanced decoding strategies.

Not ideal if you are working with older GPU hardware or if your primary goal is not accelerating LLM inference.

LLM deployment AI inference model serving computational efficiency deep learning optimization

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

1,039

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

quic/efficient-transformers

This library empowers users to seamlessly port pretrained models and checkpoints on the...

ManuelSLemos/RabbitLLM

Run 70B+ LLMs on a single 4GB GPU — no quantization required.

alpa-projects/alpa

Training and serving large-scale neural networks with auto parallelization.

arm-education/Advanced-AI-Hardware-Software-Co-Design

Hands-on course materials for ML engineers to master extreme model quantization and on-device...

deepreinforce-ai/CUDA-L2

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Explore Transformer Models

All categories Trending Transformer directory Insights