AnswerDotAI/cold-compress

Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of GPT-Fast, a simple, PyTorch-native generation codebase.

/ 100

Emerging

This toolkit helps machine learning engineers and researchers manage the memory footprint of large language models (LLMs) when generating text. It takes a trained LLM and allows you to apply various techniques to compress its internal 'KV cache,' which stores key information during text generation. The output is a more memory-efficient LLM that can process longer prompts or generate longer responses with less computational overhead.

148 stars. No commits in the last 6 months.

Use this if you are a machine learning engineer or researcher looking to experiment with and benchmark different methods for reducing the memory usage of large language models during inference.

Not ideal if you are an end-user simply looking to apply LLMs without needing to delve into their underlying memory optimization or model architecture.

large-language-models LLM-inference model-optimization machine-learning-engineering AI-research

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 13 / 25

How are scores calculated?

Stars

148

Forks

Language

Python

License

BSD-3-Clause

Higher-rated alternatives

huawei-csl/SINQ

Welcome to the official repository of SINQ! A novel, fast and high-quality quantization method...

SILX-LABS/QUASAR-SUBNET

QUASAR is a long-context foundation model and decentralized evaluation subnet built on Bittensor,

stackblogger/bitnet.js

BitNet.Js - A node.js implementation of the microsoft bitnet.cpp inference framework.

m96-chan/0xBitNet

Run BitNet b1.58 ternary LLMs with WebGPU — in browsers and native apps

FMInference/H2O

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

Explore LLM Tools

All categories Trending LLM Tool directory Insights