FMInference/H2O

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

/ 100

Emerging

Building and deploying Large Language Models (LLMs) for tasks like writing stories or powering chatbots is often expensive due to high memory use. This project helps reduce the significant memory footprint of these models, especially during long content generation, making them more affordable and efficient to run. Developers and ML engineers deploying LLMs will use this to optimize their inference systems.

506 stars. No commits in the last 6 months.

Use this if you are a developer or ML engineer working with Large Language Models and need to reduce memory consumption and improve throughput for generative inference, especially for long content.

Not ideal if you are a business user looking for a no-code solution or if your primary concern is fine-tuning an LLM rather than optimizing its deployment efficiency.

LLM deployment ML system optimization Generative AI inference Deep learning engineering Model serving

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 20 / 25

How are scores calculated?

Stars

506

Forks

Language

Python

License

—

Higher-rated alternatives

huawei-csl/SINQ

Welcome to the official repository of SINQ! A novel, fast and high-quality quantization method...

SILX-LABS/QUASAR-SUBNET

QUASAR is a long-context foundation model and decentralized evaluation subnet built on Bittensor,

stackblogger/bitnet.js

BitNet.Js - A node.js implementation of the microsoft bitnet.cpp inference framework.

m96-chan/0xBitNet

Run BitNet b1.58 ternary LLMs with WebGPU — in browsers and native apps

AnswerDotAI/cold-compress

Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking...

Explore LLM Tools

All categories Trending LLM Tool directory Insights