neelsomani/kv-marketplace

Cross-GPU KV Cache Marketplace

/ 100

Emerging

This project helps large language model (LLM) serving platforms improve efficiency by sharing common parts of user prompts across different requests and GPUs. When multiple users send prompts that start with the same phrases (like a system instruction or a common question), the system can reuse the 'memory' from processing that shared prefix instead of recomputing it every time. This means you put in multiple text prompts, and you get out faster, more efficient text generation, making it ideal for organizations running LLM inference at scale.

Use this if you are running a large language model serving environment and want to reduce redundant computations and improve throughput by reusing common prompt prefixes across multiple GPUs on the same machine.

Not ideal if your LLM workload consists primarily of unique, short prompts with no overlapping prefixes, or if you need to share KV caches across different machines.

LLM-serving AI-inference GPU-utilization chatbot-optimization text-generation-efficiency

No Package No Dependents

Maintenance 6 / 25

Adoption 6 / 25

Maturity 13 / 25

Community 11 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

deepspeedai/DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference...

helmholtz-analytics/heat

Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python

hpcaitech/ColossalAI

Making large AI models cheaper, faster and more accessible

horovod/horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

bsc-wdc/dislib

The Distributed Computing library for python implemented using PyCOMPSs programming model for HPC.

Explore ML Frameworks

All categories Trending ML Framework directory Insights