LLM Quantization Methods Transformer Models
Tools and implementations for quantizing large language models using techniques like GPTQ, AWQ, and KV cache compression to reduce model size and inference costs. Does NOT include general model compression via pruning, distillation, or training optimization.
There are 75 llm quantization methods models tracked. 3 score above 70 (verified tier). The highest-rated is ModelCloud/GPTQModel at 83/100 with 1,044 stars. 6 of the top 10 are actively maintained.
Get all 75 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-quantization-methods&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Model | Score | Tier |
|---|---|---|---|
| 1 |
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support... |
|
Verified |
| 2 |
intel/auto-round
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed... |
|
Verified |
| 3 |
pytorch/ao
PyTorch native quantization and sparsity for training and inference |
|
Verified |
| 4 |
bodaay/HuggingFaceModelDownloader
Simple go utility to download HuggingFace Models and Datasets |
|
Established |
| 5 |
NVIDIA/kvpress
LLM KV cache compression made easy |
|
Established |
| 6 |
BlinkDL/RWKV-LM
RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can... |
|
Established |
| 7 |
Picovoice/picollm
On-device LLM Inference Powered by X-Bit Quantization |
|
Established |
| 8 |
jy-yuan/KIVI
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache |
|
Emerging |
| 9 |
zackshen/gguf
a GGUF file parser |
|
Emerging |
| 10 |
back2matching/turboquant
First open-source TurboQuant KV cache compression for LLM inference. Drop-in... |
|
Emerging |
| 11 |
AutoGPTQ/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on... |
|
Emerging |
| 12 |
laelhalawani/gguf_modeldb
A quick and optimized solution to manage llama based gguf quantized models,... |
|
Emerging |
| 13 |
livingbio/fuzzy-json
Fuzzy-JSON is a compact Python package with no dependencies, designed to... |
|
Emerging |
| 14 |
ddh0/easy-llama
Python package wrapping llama.cpp for on-device LLM inference |
|
Emerging |
| 15 |
Michael-A-Kuykendall/shimmytok
Pure Rust tokenizer for GGUF models - llama.cpp compatible |
|
Emerging |
| 16 |
zjysteven/mink-plus-plus
[ICLR'25 Spotlight] Min-K%++: Improved baseline for detecting pre-training... |
|
Emerging |
| 17 |
TencentARC/LLaMA-Pro
[ACL 2024] Progressive LLaMA with Block Expansion. |
|
Emerging |
| 18 |
calcuis/gguf-core
a simple way to interact llama with gguf |
|
Emerging |
| 19 |
SqueezeAILab/SqueezeLLM
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization |
|
Emerging |
| 20 |
GAIR-NLP/ProX
[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality... |
|
Emerging |
| 21 |
laelhalawani/gguf_llama
Wrapper for simplified use of Llama2 GGUF quantized models. |
|
Emerging |
| 22 |
awneesht/KVShuttle
Benchmark & decision framework for KV cache transfer compression in... |
|
Emerging |
| 23 |
SqueezeAILab/LLM2LLM
[ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement |
|
Emerging |
| 24 |
ariannamethod/doe
DoE Janus Architecture: Democracy of Experts |
|
Emerging |
| 25 |
LMLK-seal/HuggingGGUF
Hugging Face Model downloader and GGUF Converter. |
|
Emerging |
| 26 |
gpustack/gguf-packer-go
Deliver LLMs of GGUF format via Dockerfile. |
|
Emerging |
| 27 |
Rishit-dagli/GLU
An easy-to-use library for GLU (Gated Linear Units) and GLU variants in TensorFlow. |
|
Emerging |
| 28 |
camenduru/alpaca-lora-colab
Alpaca Lora |
|
Emerging |
| 29 |
AaronFeng753/Ollama-Model-Dumper
Export and Backup Ollama models into GGUF and ModelFile |
|
Emerging |
| 30 |
NVlabs/RocketKV
[ICML 2025] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage... |
|
Emerging |
| 31 |
monk1337/auto-ollama
run ollama & gguf easily with a single command |
|
Emerging |
| 32 |
gitctrlx/llama.cu
Llama from scratch in CUDA with Flash Attention. |
|
Emerging |
| 33 |
leliuga/cohere-configurations
Co:Here Inference configurations |
|
Emerging |
| 34 |
Zishan-Shao/FlashSVD
Welcome to the FlashSVD, an activation aware inference system for SVD-based... |
|
Emerging |
| 35 |
ModelTC/QLLM
[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate... |
|
Emerging |
| 36 |
StargazerX0/ScaleKV
[NeurIPS 2025] ScaleKV: Memory-Efficient Visual Autoregressive Modeling with... |
|
Emerging |
| 37 |
Beomi/BitNet-Transformers
0️⃣1️⃣🤗 BitNet-Transformers: Huggingface Transformers Implementation of... |
|
Emerging |
| 38 |
SqueezeAILab/KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with... |
|
Emerging |
| 39 |
codewithdark-git/QuantLLM
QuantLLM is a Python library designed for developers, researchers, and teams... |
|
Emerging |
| 40 |
elephantmipt/compressors
A small library with distillation, quantization and pruning pipelines |
|
Emerging |
| 41 |
smpanaro/coreml-llm-cli
CLI to demonstrate running a large language model (LLM) on Apple Neural Engine. |
|
Emerging |
| 42 |
laelhalawani/glai
glai - GGUF LLAMA AI - Package for simplified model handling and text... |
|
Experimental |
| 43 |
calcuis/gguf-selector
GGUF selector |
|
Experimental |
| 44 |
calcuis/llama-core
solo connector core built on llama.cpp |
|
Experimental |
| 45 |
calcuis/callgg
GGUF caller |
|
Experimental |
| 46 |
lpalbou/model-quantizer
Effortlessly quantize, benchmark, and publish Hugging Face models with... |
|
Experimental |
| 47 |
petermartens98/Qwen3-LLM-Pytorch-Implementation-From-Scratch
Lightweight LLM inspired by Qwen3, built from scratch in PyTorch. Full... |
|
Experimental |
| 48 |
eliahuhorwitz/MoTHer
Official PyTorch Implementation for the "Unsupervised Model Tree Heritage... |
|
Experimental |
| 49 |
arcxteam/gguf-convert-model
Auto GGUF Converter for HuggingFace Hub Models with Multiple Quantizations... |
|
Experimental |
| 50 |
pszemraj/decoder-pytorch-template
Hackable PyTorch template for decoder-only transformer architecture... |
|
Experimental |
| 51 |
jaepil/geometric-adam
A Ray Tracing-Inspired Approach to Neural Network Optimization |
|
Experimental |
| 52 |
pecharesjoselito/chuck.optimizer
Optimize neural network training by monitoring loss, gradients, and... |
|
Experimental |
| 53 |
Keyvanhardani/kvcache-autotune
Automatic KV-Cache optimization for HuggingFace Transformers. Find the... |
|
Experimental |
| 54 |
kyegomez/open_qwen
A non-official implementation of Qwen 3.5, as there doesn’t seem to be a... |
|
Experimental |
| 55 |
SolomonB14D3/intelligent-svd
Knowledge-preserving SVD compression for large language models via... |
|
Experimental |
| 56 |
megvii-research/IntLLaMA
IntLLaMA: A fast and light quantization solution for LLaMA |
|
Experimental |
| 57 |
Evrmind-UK/evr-llama
Runtime binaries for Evrmind EVR-1 models |
|
Experimental |
| 58 |
boyazzam/kvcache-autotune
🚀 Optimize your KVCache performance with automatic tuning for efficient... |
|
Experimental |
| 59 |
Zoclee/xojo-llama
A wrapper module to do local LLM inference on GGUF models using the... |
|
Experimental |
| 60 |
zzbright1998/SentenceKV
Official implementation of "SentenceKV: Efficient LLM Inference via... |
|
Experimental |
| 61 |
bkataru/hf-hub-zig
Zig library and CLI for interacting with the HuggingFace Hub API, with a... |
|
Experimental |
| 62 |
ambv231/tinyllama-coreml-ios18-quantization
Quantize TinyLlama-1.1B-Chat from PyTorch to CoreML (float16, int8, int4)... |
|
Experimental |
| 63 |
trifledmatter/model-engine
C++ Implementation of Meta's LLaMA v2 Engine. Credited to ggerganov/llama.cpp |
|
Experimental |
| 64 |
LiteObject/llm-quantization-playground
A hands-on demo project that compares multiple quantization methods for... |
|
Experimental |
| 65 |
Kalmantic/peakweights
Data-free discovery of critical LLM weights. One forward pass. No... |
|
Experimental |
| 66 |
GodreignElgin/llm-comparision
Jupyter Notebook for LLM compression via quantization (INT8, INT4, FP16) and... |
|
Experimental |
| 67 |
1337hero/rx7900xtx-llama-bench-vulcan
Benchmark script for llama.cpp & results for AMD RX 7900 XTX - using Vulcan |
|
Experimental |
| 68 |
lciric/gptq-from-scratch
GPTQ post-training quantization from scratch — GPT-2, OPT, LLaMA support |
|
Experimental |
| 69 |
ScalingOpt/SGG
[ACL 2025 Main] Taming LLMs by Scaling Learning Rates with Gradient Grouping |
|
Experimental |
| 70 |
LMLK-seal/ModelQuants
Professional Model Quantization Converter for HuggingFace Transformers |
|
Experimental |
| 71 |
j341nono/LLMGusser
CLI guessing game to identify which LLM (Llama vs Gemma) generated text,... |
|
Experimental |
| 72 |
MohammadKaso/tiny_Llama_mcp_flutter
edge_flutter enables seamless on-device Large Language Model inference using... |
|
Experimental |
| 73 |
alpayariyak/LLaMATH
Improving Mathematical Capabilities of Large Language Models |
|
Experimental |
| 74 |
AntonioSabbatellaUni/nlp_llm_context_cost_optimization
Exploring Context Compression techniques for token reduction. Fine-tuning... |
|
Experimental |
| 75 |
hrishi-008/LoRA-adapter-to-GGUF-for-Ollama-with-code
I've put together a step-by-step guide to convert your LoRA model to GGUF for Ollama. |
|
Experimental |