AnswerDotAI/cold-compress

Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of GPT-Fast, a simple, PyTorch-native generation codebase.

39
/ 100
Emerging

This toolkit helps machine learning engineers and researchers manage the memory footprint of large language models (LLMs) when generating text. It takes a trained LLM and allows you to apply various techniques to compress its internal 'KV cache,' which stores key information during text generation. The output is a more memory-efficient LLM that can process longer prompts or generate longer responses with less computational overhead.

148 stars. No commits in the last 6 months.

Use this if you are a machine learning engineer or researcher looking to experiment with and benchmark different methods for reducing the memory usage of large language models during inference.

Not ideal if you are an end-user simply looking to apply LLMs without needing to delve into their underlying memory optimization or model architecture.

large-language-models LLM-inference model-optimization machine-learning-engineering AI-research
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 13 / 25

How are scores calculated?

Stars

148

Forks

16

Language

Python

License

BSD-3-Clause

Last pushed

Aug 09, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/AnswerDotAI/cold-compress"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.