mddunlap924/LLM-Inference-Serving

This repository demonstrates LLM execution on CPUs using packages like llamafile, emphasizing low-latency, high-throughput, and cost-effective benefits for inference and serving.

21
/ 100
Experimental

This project helps machine learning engineers and developers efficiently deploy and run large language models (LLMs) on standard computer processors (CPUs) instead of expensive graphics cards (GPUs). It takes an open-source LLM, often in a compact GGUF format, and outputs the model's text generation or numerical embeddings. This is ideal for those managing infrastructure for AI applications.

No commits in the last 6 months.

Use this if you need to serve LLMs with low latency and high throughput on cost-effective CPU hardware.

Not ideal if you already have dedicated GPU infrastructure and require the absolute highest performance for extremely large or complex models.

AI infrastructure LLM deployment edge AI model serving cost optimization
No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 5 / 25
Maturity 8 / 25
Community 8 / 25

How are scores calculated?

Stars

9

Forks

1

Language

Jupyter Notebook

License

Last pushed

Dec 04, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/mddunlap924/LLM-Inference-Serving"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.