dipampaul17/KVSplit
Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.
This project helps developers working with large language models (LLMs) on Apple Silicon Macs. It allows you to run bigger models with much longer text inputs by drastically reducing the memory needed for the LLM's working memory (KV cache). You input an LLM model file and get the ability to process longer documents or run larger models without running out of memory, often with improved speed.
362 stars. No commits in the last 6 months.
Use this if you are a developer building or running LLMs on an Apple Silicon Mac and are hitting memory limits when dealing with long contexts or larger models.
Not ideal if you are not a developer, or if you are running LLMs on hardware other than Apple Silicon.
Stars
362
Forks
13
Language
Python
License
—
Category
Last pushed
May 21, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/dipampaul17/KVSplit"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ModelEngine-Group/unified-cache-management
Persist and reuse KV Cache to speedup your LLM.
reloadware/reloadium
Hot Reloading and Profiling for Python
alibaba/tair-kvcache
Alibaba Cloud's high-performance KVCache system for LLM inference, with components for global...
October2001/Awesome-KV-Cache-Compression
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
Zefan-Cai/Awesome-LLM-KV-Cache
Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.