AnswerDotAI/cold-compress
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of GPT-Fast, a simple, PyTorch-native generation codebase.
This toolkit helps machine learning engineers and researchers manage the memory footprint of large language models (LLMs) when generating text. It takes a trained LLM and allows you to apply various techniques to compress its internal 'KV cache,' which stores key information during text generation. The output is a more memory-efficient LLM that can process longer prompts or generate longer responses with less computational overhead.
148 stars. No commits in the last 6 months.
Use this if you are a machine learning engineer or researcher looking to experiment with and benchmark different methods for reducing the memory usage of large language models during inference.
Not ideal if you are an end-user simply looking to apply LLMs without needing to delve into their underlying memory optimization or model architecture.
Stars
148
Forks
16
Language
Python
License
BSD-3-Clause
Category
Last pushed
Aug 09, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/AnswerDotAI/cold-compress"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
huawei-csl/SINQ
Welcome to the official repository of SINQ! A novel, fast and high-quality quantization method...
SILX-LABS/QUASAR-SUBNET
QUASAR is a long-context foundation model and decentralized evaluation subnet built on Bittensor,
stackblogger/bitnet.js
BitNet.Js - A node.js implementation of the microsoft bitnet.cpp inference framework.
m96-chan/0xBitNet
Run BitNet b1.58 ternary LLMs with WebGPU — in browsers and native apps
FMInference/H2O
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.