ManuelSLemos/RabbitLLM
Run 70B+ LLMs on a single 4GB GPU — no quantization required.
This tool helps developers run very large language models (LLMs) like Qwen2/3, with billions of parameters, on ordinary consumer graphics cards that have as little as 4GB of video memory. It takes a standard HuggingFace model and allows you to generate text responses without needing specialized hardware. This is for software engineers, ML engineers, or researchers building AI applications or prototypes who want to deploy large LLMs without expensive, high-VRAM GPUs.
Available on PyPI.
Use this if you need to perform inference with large language models (70B+ parameters) on a single GPU with limited VRAM (e.g., 4GB) without sacrificing model quality through quantization.
Not ideal if you need compatibility with LLM architectures other than Qwen2/3, or if you are working on macOS/Apple Silicon.
Stars
38
Forks
7
Language
Python
License
Apache-2.0
Category
Last pushed
Feb 28, 2026
Commits (30d)
0
Dependencies
12
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/ManuelSLemos/RabbitLLM"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
quic/efficient-transformers
This library empowers users to seamlessly port pretrained models and checkpoints on the...
alpa-projects/alpa
Training and serving large-scale neural networks with auto parallelization.
arm-education/Advanced-AI-Hardware-Software-Co-Design
Hands-on course materials for ML engineers to master extreme model quantization and on-device...
IST-DASLab/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes...
deepreinforce-ai/CUDA-L2
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning