AutonomicPerfectionist/PipeInfer
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
This project helps machine learning engineers and researchers accelerate how quickly large language models (LLMs) generate text. By using two models—a small 'speculative' one and a large 'target' one—it can produce responses much faster than a single model. You provide your LLMs and a text prompt, and it outputs the generated text at a significantly increased speed.
No commits in the last 6 months.
Use this if you need to dramatically speed up text generation from Llama, Falcon, Baichuan, or other compatible large language models running across a multi-node computing cluster.
Not ideal if you are running LLM inference on a single machine or do not have access to a distributed computing environment.
Stars
32
Forks
5
Language
C++
License
MIT
Category
Last pushed
Nov 16, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/AutonomicPerfectionist/PipeInfer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
quic/efficient-transformers
This library empowers users to seamlessly port pretrained models and checkpoints on the...
ManuelSLemos/RabbitLLM
Run 70B+ LLMs on a single 4GB GPU — no quantization required.
alpa-projects/alpa
Training and serving large-scale neural networks with auto parallelization.
arm-education/Advanced-AI-Hardware-Software-Co-Design
Hands-on course materials for ML engineers to master extreme model quantization and on-device...
IST-DASLab/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes...