LLM Quantization Methods Transformer Models

Tools and implementations for quantizing large language models using techniques like GPTQ, AWQ, and KV cache compression to reduce model size and inference costs. Does NOT include general model compression via pruning, distillation, or training optimization.

There are 75 llm quantization methods models tracked. 3 score above 70 (verified tier). The highest-rated is ModelCloud/GPTQModel at 83/100 with 1,044 stars. 6 of the top 10 are actively maintained.

Get all 75 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-quantization-methods&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	ModelCloud/GPTQModel LLM model quantization (compression) toolkit with hw acceleration support...	83	Verified	1,044	Python
2	intel/auto-round 🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed...	75	Verified	883	Python
3	pytorch/ao PyTorch native quantization and sparsity for training and inference	71	Verified	2,729	Python
4	bodaay/HuggingFaceModelDownloader Simple go utility to download HuggingFace Models and Datasets	63	Established	915	Go
5	NVIDIA/kvpress LLM KV cache compression made easy	63	Established	954	Python
6	BlinkDL/RWKV-LM RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can...	57	Established	14,414	Python
7	Picovoice/picollm On-device LLM Inference Powered by X-Bit Quantization	57	Established	305	Python
8	jy-yuan/KIVI [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache	49	Emerging	359	Python
9	zackshen/gguf a GGUF file parser	47	Emerging	17	Rust
10	back2matching/turboquant First open-source TurboQuant KV cache compression for LLM inference. Drop-in...	47	Emerging	5	Python
11	AutoGPTQ/AutoGPTQ An easy-to-use LLMs quantization package with user-friendly apis, based on...	46	Emerging	5,033	Python
12	laelhalawani/gguf_modeldb A quick and optimized solution to manage llama based gguf quantized models,...	45	Emerging	12	Python
13	livingbio/fuzzy-json Fuzzy-JSON is a compact Python package with no dependencies, designed to...	45	Emerging	43	Python
14	ddh0/easy-llama Python package wrapping llama.cpp for on-device LLM inference	45	Emerging	101	Python
15	Michael-A-Kuykendall/shimmytok Pure Rust tokenizer for GGUF models - llama.cpp compatible	43	Emerging	14	Rust
16	zjysteven/mink-plus-plus [ICLR'25 Spotlight] Min-K%++: Improved baseline for detecting pre-training...	41	Emerging	54	Python
17	TencentARC/LLaMA-Pro [ACL 2024] Progressive LLaMA with Block Expansion.	41	Emerging	514	Python
18	calcuis/gguf-core a simple way to interact llama with gguf	41	Emerging	5	Python
19	SqueezeAILab/SqueezeLLM [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization	41	Emerging	713	Python
20	GAIR-NLP/ProX [ICML 2025] Programming Every Example: Lifting Pre-training Data Quality...	40	Emerging	266	Python
21	laelhalawani/gguf_llama Wrapper for simplified use of Llama2 GGUF quantized models.	39	Emerging	7	Python
22	awneesht/KVShuttle Benchmark & decision framework for KV cache transfer compression in...	38	Emerging	5	Python
23	SqueezeAILab/LLM2LLM [ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement	38	Emerging	194	Python
24	ariannamethod/doe DoE Janus Architecture: Democracy of Experts	38	Emerging	4	C
25	LMLK-seal/HuggingGGUF Hugging Face Model downloader and GGUF Converter.	37	Emerging	13	Python
26	gpustack/gguf-packer-go Deliver LLMs of GGUF format via Dockerfile.	37	Emerging	15	Go
27	Rishit-dagli/GLU An easy-to-use library for GLU (Gated Linear Units) and GLU variants in TensorFlow.	37	Emerging	20	Python
28	camenduru/alpaca-lora-colab Alpaca Lora	36	Emerging	25	Jupyter Notebook
29	AaronFeng753/Ollama-Model-Dumper Export and Backup Ollama models into GGUF and ModelFile	36	Emerging	92	Python
30	NVlabs/RocketKV [ICML 2025] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage...	36	Emerging	34	Python
31	monk1337/auto-ollama run ollama & gguf easily with a single command	34	Emerging	52	Shell
32	gitctrlx/llama.cu Llama from scratch in CUDA with Flash Attention.	34	Emerging	43	Cuda
33	leliuga/cohere-configurations Co:Here Inference configurations	34	Emerging	10	Go
34	Zishan-Shao/FlashSVD Welcome to the FlashSVD, an activation aware inference system for SVD-based...	34	Emerging	7	Python
35	ModelTC/QLLM [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate...	34	Emerging	39	Python
36	StargazerX0/ScaleKV [NeurIPS 2025] ScaleKV: Memory-Efficient Visual Autoregressive Modeling with...	34	Emerging	50	Python
37	Beomi/BitNet-Transformers 0️⃣1️⃣🤗 BitNet-Transformers: Huggingface Transformers Implementation of...	34	Emerging	313	Python
38	SqueezeAILab/KVQuant [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with...	34	Emerging	406	Python
39	codewithdark-git/QuantLLM QuantLLM is a Python library designed for developers, researchers, and teams...	33	Emerging	13	Python
40	elephantmipt/compressors A small library with distillation, quantization and pruning pipelines	33	Emerging	26	Python
41	smpanaro/coreml-llm-cli CLI to demonstrate running a large language model (LLM) on Apple Neural Engine.	31	Emerging	124	Swift
42	laelhalawani/glai glai - GGUF LLAMA AI - Package for simplified model handling and text...	29	Experimental	6	Python
43	calcuis/gguf-selector GGUF selector	29	Experimental	1	Python
44	calcuis/llama-core solo connector core built on llama.cpp	29	Experimental	1	Python
45	calcuis/callgg GGUF caller	29	Experimental	1	Python
46	lpalbou/model-quantizer Effortlessly quantize, benchmark, and publish Hugging Face models with...	27	Experimental	2	Python
47	petermartens98/Qwen3-LLM-Pytorch-Implementation-From-Scratch Lightweight LLM inspired by Qwen3, built from scratch in PyTorch. Full...	26	Experimental	3	Jupyter Notebook
48	eliahuhorwitz/MoTHer Official PyTorch Implementation for the "Unsupervised Model Tree Heritage...	26	Experimental	63	Python
49	arcxteam/gguf-convert-model Auto GGUF Converter for HuggingFace Hub Models with Multiple Quantizations...	25	Experimental	2	Python
50	pszemraj/decoder-pytorch-template Hackable PyTorch template for decoder-only transformer architecture...	24	Experimental	1	Python
51	jaepil/geometric-adam A Ray Tracing-Inspired Approach to Neural Network Optimization	23	Experimental	17	Python
52	pecharesjoselito/chuck.optimizer Optimize neural network training by monitoring loss, gradients, and...	22	Experimental	—	C
53	Keyvanhardani/kvcache-autotune Automatic KV-Cache optimization for HuggingFace Transformers. Find the...	22	Experimental	1	Python
54	kyegomez/open_qwen A non-official implementation of Qwen 3.5, as there doesn’t seem to be a...	22	Experimental	1	Python
55	SolomonB14D3/intelligent-svd Knowledge-preserving SVD compression for large language models via...	22	Experimental	1	Python
56	megvii-research/IntLLaMA IntLLaMA: A fast and light quantization solution for LLaMA	22	Experimental	18	Python
57	Evrmind-UK/evr-llama Runtime binaries for Evrmind EVR-1 models	22	Experimental	1	—
58	boyazzam/kvcache-autotune 🚀 Optimize your KVCache performance with automatic tuning for efficient...	21	Experimental	—	Python
59	Zoclee/xojo-llama A wrapper module to do local LLM inference on GGUF models using the...	21	Experimental	2	Xojo
60	zzbright1998/SentenceKV Official implementation of "SentenceKV: Efficient LLM Inference via...	21	Experimental	11	Python
61	bkataru/hf-hub-zig Zig library and CLI for interacting with the HuggingFace Hub API, with a...	19	Experimental	—	Zig
62	ambv231/tinyllama-coreml-ios18-quantization Quantize TinyLlama-1.1B-Chat from PyTorch to CoreML (float16, int8, int4)...	19	Experimental	2	Python
63	trifledmatter/model-engine C++ Implementation of Meta's LLaMA v2 Engine. Credited to ggerganov/llama.cpp	18	Experimental	2	C
64	LiteObject/llm-quantization-playground A hands-on demo project that compares multiple quantization methods for...	17	Experimental	—	—
65	Kalmantic/peakweights Data-free discovery of critical LLM weights. One forward pass. No...	17	Experimental	—	Jupyter Notebook
66	GodreignElgin/llm-comparision Jupyter Notebook for LLM compression via quantization (INT8, INT4, FP16) and...	15	Experimental	—	Jupyter Notebook
67	1337hero/rx7900xtx-llama-bench-vulcan Benchmark script for llama.cpp & results for AMD RX 7900 XTX - using Vulcan	15	Experimental	—	Shell
68	lciric/gptq-from-scratch GPTQ post-training quantization from scratch — GPT-2, OPT, LLaMA support	14	Experimental	1	Jupyter Notebook
69	ScalingOpt/SGG [ACL 2025 Main] Taming LLMs by Scaling Learning Rates with Gradient Grouping	14	Experimental	9	JavaScript
70	LMLK-seal/ModelQuants Professional Model Quantization Converter for HuggingFace Transformers	13	Experimental	—	Python
71	j341nono/LLMGusser CLI guessing game to identify which LLM (Llama vs Gemma) generated text,...	13	Experimental	—	Python
72	MohammadKaso/tiny_Llama_mcp_flutter edge_flutter enables seamless on-device Large Language Model inference using...	11	Experimental	—	Swift
73	alpayariyak/LLaMATH Improving Mathematical Capabilities of Large Language Models	11	Experimental	3	—
74	AntonioSabbatellaUni/nlp_llm_context_cost_optimization Exploring Context Compression techniques for token reduction. Fine-tuning...	10	Experimental	1	Jupyter Notebook
75	hrishi-008/LoRA-adapter-to-GGUF-for-Ollama-with-code I've put together a step-by-step guide to convert your LoRA model to GGUF for Ollama.	10	Experimental	2	—

Comparisons in this category

GPTQModel and AutoGPTQ (83 vs 46) picollm and SqueezeLLM (57 vs 41)