thansen0/fastllm.cpp

A low latency, fault tolerant API for accessing LLM's written in C++ using llama.cpp.

/ 100

Experimental

This project helps developers and system architects deploy large language models (LLMs) on their own infrastructure, ensuring very fast response times. It takes a pre-trained LLM (in GGUF format) and provides an API service that other applications can call. This is ideal for backend engineers or MLOps specialists building applications that rely on immediate LLM responses.

No commits in the last 6 months.

Use this if you need to integrate a private, high-speed LLM inference service directly into your applications, avoiding the latency of external cloud APIs.

Not ideal if you are looking for a pre-packaged, ready-to-use LLM without local setup or if your application can tolerate higher latency from cloud-based LLM providers.

LLM deployment API development low-latency systems MLOps backend engineering

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

C++

License

Unlicense

Higher-rated alternatives

beehive-lab/GPULlama3.java

GPU-accelerated Llama3.java inference in pure Java using TornadoVM.

gitkaz/mlx_gguf_server

This is a FastAPI based LLM server. Load multiple LLM models (MLX or llama.cpp) simultaneously...

srgtuszy/llama-cpp-swift

Swift bindings for llama-cpp library

JackZeng0208/llama.cpp-android-tutorial

llama.cpp tutorial on Android phone

awinml/llama-cpp-python-bindings

Run fast LLM Inference using Llama.cpp in Python

Explore Transformer Models

All categories Trending Transformer directory Insights