mddunlap924/LLM-Inference-Serving

This repository demonstrates LLM execution on CPUs using packages like llamafile, emphasizing low-latency, high-throughput, and cost-effective benefits for inference and serving.

/ 100

Experimental

This project helps machine learning engineers and developers efficiently deploy and run large language models (LLMs) on standard computer processors (CPUs) instead of expensive graphics cards (GPUs). It takes an open-source LLM, often in a compact GGUF format, and outputs the model's text generation or numerical embeddings. This is ideal for those managing infrastructure for AI applications.

No commits in the last 6 months.

Use this if you need to serve LLMs with low latency and high throughput on cost-effective CPU hardware.

Not ideal if you already have dedicated GPU infrastructure and require the absolute highest performance for extremely large or complex models.

AI infrastructure LLM deployment edge AI model serving cost optimization

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 8 / 25

Community 8 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

—

Higher-rated alternatives

thu-pacman/chitu

High-performance inference framework for large language models, focusing on efficiency,...

NotPunchnox/rkllama

Ollama alternative for Rockchip NPU: An efficient solution for running AI and Deep learning...

sophgo/LLM-TPU

Run generative AI models in sophgo BM1684X/BM1688

Deep-Spark/DeepSparkHub

DeepSparkHub selects hundreds of application algorithms and models, covering various fields of...

howard-hou/VisualRWKV

VisualRWKV is the visual-enhanced version of the RWKV language model, enabling RWKV to handle...

Explore LLM Tools

All categories Trending LLM Tool directory Insights