mddunlap924/LLM-Inference-Serving
This repository demonstrates LLM execution on CPUs using packages like llamafile, emphasizing low-latency, high-throughput, and cost-effective benefits for inference and serving.
This project helps machine learning engineers and developers efficiently deploy and run large language models (LLMs) on standard computer processors (CPUs) instead of expensive graphics cards (GPUs). It takes an open-source LLM, often in a compact GGUF format, and outputs the model's text generation or numerical embeddings. This is ideal for those managing infrastructure for AI applications.
No commits in the last 6 months.
Use this if you need to serve LLMs with low latency and high throughput on cost-effective CPU hardware.
Not ideal if you already have dedicated GPU infrastructure and require the absolute highest performance for extremely large or complex models.
Stars
9
Forks
1
Language
Jupyter Notebook
License
—
Category
Last pushed
Dec 04, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/mddunlap924/LLM-Inference-Serving"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
thu-pacman/chitu
High-performance inference framework for large language models, focusing on efficiency,...
NotPunchnox/rkllama
Ollama alternative for Rockchip NPU: An efficient solution for running AI and Deep learning...
sophgo/LLM-TPU
Run generative AI models in sophgo BM1684X/BM1688
Deep-Spark/DeepSparkHub
DeepSparkHub selects hundreds of application algorithms and models, covering various fields of...
howard-hou/VisualRWKV
VisualRWKV is the visual-enhanced version of the RWKV language model, enabling RWKV to handle...