Siddhant-K-code/tokenvm

TokenVM is a high-performance runtime that treats LLM KV cache and activations as a virtual memory working set across GPU VRAM → pinned host RAM → NVMe storage, with intelligent paging, prefetching, and compute-copy overlap.

/ 100

Experimental

This high-performance runtime helps developers working with large language models (LLMs) to use much longer text sequences than typically possible. It intelligently manages the model's working memory across GPU, host RAM, and NVMe storage. As input, you provide your existing LLM code, and it outputs an LLM that can process significantly longer contexts more efficiently.

No commits in the last 6 months.

Use this if you are an ML engineer or researcher who needs to run LLMs with very long context windows (e.g., 32,000 to 64,000 tokens) but are constrained by GPU memory.

Not ideal if you are working with smaller LLMs or short context windows where memory optimization is not a primary concern, or if you require a stable, production-ready solution without further testing.

LLM-development GPU-optimization large-context-AI deep-learning-infrastructure

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 5 / 25

Maturity 15 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

License

MIT

Higher-rated alternatives

ModelEngine-Group/unified-cache-management

Persist and reuse KV Cache to speedup your LLM.

reloadware/reloadium

Hot Reloading and Profiling for Python

alibaba/tair-kvcache

Alibaba Cloud's high-performance KVCache system for LLM inference, with components for global...

October2001/Awesome-KV-Cache-Compression

📰 Must-read papers on KV Cache Compression (constantly updating 🤗).

Zefan-Cai/Awesome-LLM-KV-Cache

Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.

Explore LLM Tools

All categories Trending LLM Tool directory Insights