CoderLSF/fast-llama
Runs LLaMA with Extremely HIGH speed
This project offers a powerful solution for developers looking to run large language models, specifically LLaMA2, on CPU hardware at extremely high speeds. It takes LLaMA2 models, including those from HuggingFace, and processes them to generate text outputs much faster than other existing engines. Developers building applications that need efficient, on-premise language model inference will find this tool valuable.
No commits in the last 6 months.
Use this if you are a developer deploying LLaMA2 models and need a C++ inference engine that prioritizes CPU performance and speed.
Not ideal if you are looking for a plug-and-play solution for non-developers or require extensive GPU support right now.
Stars
95
Forks
10
Language
C++
License
MIT
Category
Last pushed
Nov 21, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/CoderLSF/fast-llama"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
sgl-project/sglang
SGLang is a high-performance serving framework for large language models and multimodal models.
alibaba/MNN
MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...
xorbitsai/inference
Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...
tensorzero/tensorzero
TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...