FMInference/H2O
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
Building and deploying Large Language Models (LLMs) for tasks like writing stories or powering chatbots is often expensive due to high memory use. This project helps reduce the significant memory footprint of these models, especially during long content generation, making them more affordable and efficient to run. Developers and ML engineers deploying LLMs will use this to optimize their inference systems.
506 stars. No commits in the last 6 months.
Use this if you are a developer or ML engineer working with Large Language Models and need to reduce memory consumption and improve throughput for generative inference, especially for long content.
Not ideal if you are a business user looking for a no-code solution or if your primary concern is fine-tuning an LLM rather than optimizing its deployment efficiency.
Stars
506
Forks
74
Language
Python
License
—
Category
Last pushed
Aug 01, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/FMInference/H2O"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
huawei-csl/SINQ
Welcome to the official repository of SINQ! A novel, fast and high-quality quantization method...
SILX-LABS/QUASAR-SUBNET
QUASAR is a long-context foundation model and decentralized evaluation subnet built on Bittensor,
stackblogger/bitnet.js
BitNet.Js - A node.js implementation of the microsoft bitnet.cpp inference framework.
m96-chan/0xBitNet
Run BitNet b1.58 ternary LLMs with WebGPU — in browsers and native apps
AnswerDotAI/cold-compress
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking...