thansen0/fastllm.cpp
A low latency, fault tolerant API for accessing LLM's written in C++ using llama.cpp.
This project helps developers and system architects deploy large language models (LLMs) on their own infrastructure, ensuring very fast response times. It takes a pre-trained LLM (in GGUF format) and provides an API service that other applications can call. This is ideal for backend engineers or MLOps specialists building applications that rely on immediate LLM responses.
No commits in the last 6 months.
Use this if you need to integrate a private, high-speed LLM inference service directly into your applications, avoiding the latency of external cloud APIs.
Not ideal if you are looking for a pre-packaged, ready-to-use LLM without local setup or if your application can tolerate higher latency from cloud-based LLM providers.
Stars
11
Forks
—
Language
C++
License
Unlicense
Category
Last pushed
Jun 14, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/thansen0/fastllm.cpp"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
beehive-lab/GPULlama3.java
GPU-accelerated Llama3.java inference in pure Java using TornadoVM.
gitkaz/mlx_gguf_server
This is a FastAPI based LLM server. Load multiple LLM models (MLX or llama.cpp) simultaneously...
srgtuszy/llama-cpp-swift
Swift bindings for llama-cpp library
JackZeng0208/llama.cpp-android-tutorial
llama.cpp tutorial on Android phone
awinml/llama-cpp-python-bindings
Run fast LLM Inference using Llama.cpp in Python