IST-DASLab/marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

43
/ 100
Emerging

Marlin helps engineers and researchers working with large language models (LLMs) accelerate how quickly their models can generate responses. It takes your existing FP16xINT4 quantized LLM weights and processes them to produce significantly faster inference, especially when handling multiple user requests at once. This tool is for those who deploy and manage LLMs and need to serve many users efficiently.

1,039 stars. No commits in the last 6 months.

Use this if you need to dramatically speed up the inference performance of your large language models, particularly when serving a medium number of simultaneous user requests or running advanced decoding strategies.

Not ideal if you are working with older GPU hardware or if your primary goal is not accelerating LLM inference.

LLM deployment AI inference model serving computational efficiency deep learning optimization
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 17 / 25

How are scores calculated?

Stars

1,039

Forks

86

Language

Python

License

Apache-2.0

Last pushed

Sep 04, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/IST-DASLab/marlin"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.