LLM Quantization Methods Transformer Models

Tools and implementations for quantizing large language models using techniques like GPTQ, AWQ, and KV cache compression to reduce model size and inference costs. Does NOT include general model compression via pruning, distillation, or training optimization.

There are 75 llm quantization methods models tracked. 3 score above 70 (verified tier). The highest-rated is ModelCloud/GPTQModel at 83/100 with 1,044 stars. 6 of the top 10 are actively maintained.

Get all 75 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-quantization-methods&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 ModelCloud/GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support...

83
Verified
2 intel/auto-round

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed...

75
Verified
3 pytorch/ao

PyTorch native quantization and sparsity for training and inference

71
Verified
4 bodaay/HuggingFaceModelDownloader

Simple go utility to download HuggingFace Models and Datasets

63
Established
5 NVIDIA/kvpress

LLM KV cache compression made easy

63
Established
6 BlinkDL/RWKV-LM

RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can...

57
Established
7 Picovoice/picollm

On-device LLM Inference Powered by X-Bit Quantization

57
Established
8 jy-yuan/KIVI

[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

49
Emerging
9 zackshen/gguf

a GGUF file parser

47
Emerging
10 back2matching/turboquant

First open-source TurboQuant KV cache compression for LLM inference. Drop-in...

47
Emerging
11 AutoGPTQ/AutoGPTQ

An easy-to-use LLMs quantization package with user-friendly apis, based on...

46
Emerging
12 laelhalawani/gguf_modeldb

A quick and optimized solution to manage llama based gguf quantized models,...

45
Emerging
13 livingbio/fuzzy-json

Fuzzy-JSON is a compact Python package with no dependencies, designed to...

45
Emerging
14 ddh0/easy-llama

Python package wrapping llama.cpp for on-device LLM inference

45
Emerging
15 Michael-A-Kuykendall/shimmytok

Pure Rust tokenizer for GGUF models - llama.cpp compatible

43
Emerging
16 zjysteven/mink-plus-plus

[ICLR'25 Spotlight] Min-K%++: Improved baseline for detecting pre-training...

41
Emerging
17 TencentARC/LLaMA-Pro

[ACL 2024] Progressive LLaMA with Block Expansion.

41
Emerging
18 calcuis/gguf-core

a simple way to interact llama with gguf

41
Emerging
19 SqueezeAILab/SqueezeLLM

[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization

41
Emerging
20 GAIR-NLP/ProX

[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality...

40
Emerging
21 laelhalawani/gguf_llama

Wrapper for simplified use of Llama2 GGUF quantized models.

39
Emerging
22 awneesht/KVShuttle

Benchmark & decision framework for KV cache transfer compression in...

38
Emerging
23 SqueezeAILab/LLM2LLM

[ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

38
Emerging
24 ariannamethod/doe

DoE Janus Architecture: Democracy of Experts

38
Emerging
25 LMLK-seal/HuggingGGUF

Hugging Face Model downloader and GGUF Converter.

37
Emerging
26 gpustack/gguf-packer-go

Deliver LLMs of GGUF format via Dockerfile.

37
Emerging
27 Rishit-dagli/GLU

An easy-to-use library for GLU (Gated Linear Units) and GLU variants in TensorFlow.

37
Emerging
28 camenduru/alpaca-lora-colab

Alpaca Lora

36
Emerging
29 AaronFeng753/Ollama-Model-Dumper

Export and Backup Ollama models into GGUF and ModelFile

36
Emerging
30 NVlabs/RocketKV

[ICML 2025] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage...

36
Emerging
31 monk1337/auto-ollama

run ollama & gguf easily with a single command

34
Emerging
32 gitctrlx/llama.cu

Llama from scratch in CUDA with Flash Attention.

34
Emerging
33 leliuga/cohere-configurations

Co:Here Inference configurations

34
Emerging
34 Zishan-Shao/FlashSVD

Welcome to the FlashSVD, an activation aware inference system for SVD-based...

34
Emerging
35 ModelTC/QLLM

[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate...

34
Emerging
36 StargazerX0/ScaleKV

[NeurIPS 2025] ScaleKV: Memory-Efficient Visual Autoregressive Modeling with...

34
Emerging
37 Beomi/BitNet-Transformers

0️⃣1️⃣🤗 BitNet-Transformers: Huggingface Transformers Implementation of...

34
Emerging
38 SqueezeAILab/KVQuant

[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with...

34
Emerging
39 codewithdark-git/QuantLLM

QuantLLM is a Python library designed for developers, researchers, and teams...

33
Emerging
40 elephantmipt/compressors

A small library with distillation, quantization and pruning pipelines

33
Emerging
41 smpanaro/coreml-llm-cli

CLI to demonstrate running a large language model (LLM) on Apple Neural Engine.

31
Emerging
42 laelhalawani/glai

glai - GGUF LLAMA AI - Package for simplified model handling and text...

29
Experimental
43 calcuis/gguf-selector

GGUF selector

29
Experimental
44 calcuis/llama-core

solo connector core built on llama.cpp

29
Experimental
45 calcuis/callgg

GGUF caller

29
Experimental
46 lpalbou/model-quantizer

Effortlessly quantize, benchmark, and publish Hugging Face models with...

27
Experimental
47 petermartens98/Qwen3-LLM-Pytorch-Implementation-From-Scratch

Lightweight LLM inspired by Qwen3, built from scratch in PyTorch. Full...

26
Experimental
48 eliahuhorwitz/MoTHer

Official PyTorch Implementation for the "Unsupervised Model Tree Heritage...

26
Experimental
49 arcxteam/gguf-convert-model

Auto GGUF Converter for HuggingFace Hub Models with Multiple Quantizations...

25
Experimental
50 pszemraj/decoder-pytorch-template

Hackable PyTorch template for decoder-only transformer architecture...

24
Experimental
51 jaepil/geometric-adam

A Ray Tracing-Inspired Approach to Neural Network Optimization

23
Experimental
52 pecharesjoselito/chuck.optimizer

Optimize neural network training by monitoring loss, gradients, and...

22
Experimental
53 Keyvanhardani/kvcache-autotune

Automatic KV-Cache optimization for HuggingFace Transformers. Find the...

22
Experimental
54 kyegomez/open_qwen

A non-official implementation of Qwen 3.5, as there doesn’t seem to be a...

22
Experimental
55 SolomonB14D3/intelligent-svd

Knowledge-preserving SVD compression for large language models via...

22
Experimental
56 megvii-research/IntLLaMA

IntLLaMA: A fast and light quantization solution for LLaMA

22
Experimental
57 Evrmind-UK/evr-llama

Runtime binaries for Evrmind EVR-1 models

22
Experimental
58 boyazzam/kvcache-autotune

🚀 Optimize your KVCache performance with automatic tuning for efficient...

21
Experimental
59 Zoclee/xojo-llama

A wrapper module to do local LLM inference on GGUF models using the...

21
Experimental
60 zzbright1998/SentenceKV

Official implementation of "SentenceKV: Efficient LLM Inference via...

21
Experimental
61 bkataru/hf-hub-zig

Zig library and CLI for interacting with the HuggingFace Hub API, with a...

19
Experimental
62 ambv231/tinyllama-coreml-ios18-quantization

Quantize TinyLlama-1.1B-Chat from PyTorch to CoreML (float16, int8, int4)...

19
Experimental
63 trifledmatter/model-engine

C++ Implementation of Meta's LLaMA v2 Engine. Credited to ggerganov/llama.cpp

18
Experimental
64 LiteObject/llm-quantization-playground

A hands-on demo project that compares multiple quantization methods for...

17
Experimental
65 Kalmantic/peakweights

Data-free discovery of critical LLM weights. One forward pass. No...

17
Experimental
66 GodreignElgin/llm-comparision

Jupyter Notebook for LLM compression via quantization (INT8, INT4, FP16) and...

15
Experimental
67 1337hero/rx7900xtx-llama-bench-vulcan

Benchmark script for llama.cpp & results for AMD RX 7900 XTX - using Vulcan

15
Experimental
68 lciric/gptq-from-scratch

GPTQ post-training quantization from scratch — GPT-2, OPT, LLaMA support

14
Experimental
69 ScalingOpt/SGG

[ACL 2025 Main] Taming LLMs by Scaling Learning Rates with Gradient Grouping

14
Experimental
70 LMLK-seal/ModelQuants

Professional Model Quantization Converter for HuggingFace Transformers

13
Experimental
71 j341nono/LLMGusser

CLI guessing game to identify which LLM (Llama vs Gemma) generated text,...

13
Experimental
72 MohammadKaso/tiny_Llama_mcp_flutter

edge_flutter enables seamless on-device Large Language Model inference using...

11
Experimental
73 alpayariyak/LLaMATH

Improving Mathematical Capabilities of Large Language Models

11
Experimental
74 AntonioSabbatellaUni/nlp_llm_context_cost_optimization

Exploring Context Compression techniques for token reduction. Fine-tuning...

10
Experimental
75 hrishi-008/LoRA-adapter-to-GGUF-for-Ollama-with-code

I've put together a step-by-step guide to convert your LoRA model to GGUF for Ollama.

10
Experimental