ManuelSLemos/RabbitLLM

Run 70B+ LLMs on a single 4GB GPU — no quantization required.

52
/ 100
Established

This tool helps developers run very large language models (LLMs) like Qwen2/3, with billions of parameters, on ordinary consumer graphics cards that have as little as 4GB of video memory. It takes a standard HuggingFace model and allows you to generate text responses without needing specialized hardware. This is for software engineers, ML engineers, or researchers building AI applications or prototypes who want to deploy large LLMs without expensive, high-VRAM GPUs.

Available on PyPI.

Use this if you need to perform inference with large language models (70B+ parameters) on a single GPU with limited VRAM (e.g., 4GB) without sacrificing model quality through quantization.

Not ideal if you need compatibility with LLM architectures other than Qwen2/3, or if you are working on macOS/Apple Silicon.

LLM deployment AI application development resource-constrained AI machine learning engineering natural language processing
Maintenance 10 / 25
Adoption 7 / 25
Maturity 20 / 25
Community 15 / 25

How are scores calculated?

Stars

38

Forks

7

Language

Python

License

Apache-2.0

Last pushed

Feb 28, 2026

Commits (30d)

0

Dependencies

12

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/ManuelSLemos/RabbitLLM"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.