ManuelSLemos/RabbitLLM

Run 70B+ LLMs on a single 4GB GPU — no quantization required.

/ 100

Established

This tool helps developers run very large language models (LLMs) like Qwen2/3, with billions of parameters, on ordinary consumer graphics cards that have as little as 4GB of video memory. It takes a standard HuggingFace model and allows you to generate text responses without needing specialized hardware. This is for software engineers, ML engineers, or researchers building AI applications or prototypes who want to deploy large LLMs without expensive, high-VRAM GPUs.

Available on PyPI.

Use this if you need to perform inference with large language models (70B+ parameters) on a single GPU with limited VRAM (e.g., 4GB) without sacrificing model quality through quantization.

Not ideal if you need compatibility with LLM architectures other than Qwen2/3, or if you are working on macOS/Apple Silicon.

LLM deployment AI application development resource-constrained AI machine learning engineering natural language processing

Maintenance 10 / 25

Adoption 7 / 25

Maturity 20 / 25

Community 15 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Related models

quic/efficient-transformers

This library empowers users to seamlessly port pretrained models and checkpoints on the...

alpa-projects/alpa

Training and serving large-scale neural networks with auto parallelization.

arm-education/Advanced-AI-Hardware-Software-Co-Design

Hands-on course materials for ML engineers to master extreme model quantization and on-device...

IST-DASLab/marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes...

deepreinforce-ai/CUDA-L2

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Explore Transformer Models

All categories Trending Transformer directory Insights