James-QiuHaoran/LLM-serving-with-proxy-models

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction | A tiny BERT model can tell you the verbosity of an LLM (with low latency overhead!)

39
/ 100
Emerging

This project helps operations engineers optimize the performance of large language model (LLM) serving systems. It predicts how long an LLM will take to respond to a user's query, allowing the system to handle requests more efficiently. The input is a user query, and the output is a prediction of the LLM's response length, which a scheduler uses to improve overall system throughput and reduce wait times. LLM system administrators and cloud engineers who manage these services would use this.

No commits in the last 6 months.

Use this if you are running an LLM inference service and want to reduce user wait times and improve system efficiency without changing your core memory or cache management.

Not ideal if you are a data scientist looking for a general-purpose LLM or an application developer integrating LLMs into your product, as this focuses on infrastructure optimization.

LLM-operations cloud-infrastructure system-optimization latency-reduction resource-scheduling
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 8 / 25
Maturity 16 / 25
Community 15 / 25

How are scores calculated?

Stars

49

Forks

8

Language

Jupyter Notebook

License

Apache-2.0

Last pushed

Jun 01, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/James-QiuHaoran/LLM-serving-with-proxy-models"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.