gitkaz/mlx_gguf_server
This is a FastAPI based LLM server. Load multiple LLM models (MLX or llama.cpp) simultaneously using multiprocessing.
This project helps developers serve multiple large language models (LLMs) on Apple Silicon Macs. It allows you to load various MLX or GGUF format models simultaneously and interact with them via a web API, processing text prompts for completions or chat, and even transcribing audio. It's designed for developers building applications that need to leverage different LLMs efficiently on macOS.
Use this if you are a developer building applications on an Apple Silicon Mac and need to host and manage multiple LLM models and transcribe audio efficiently via an API.
Not ideal if you need a production-ready, highly scalable LLM serving solution for non-Apple hardware, or if you are not comfortable with API-driven interactions.
Stars
17
Forks
4
Language
Python
License
MIT
Category
Last pushed
Mar 27, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/gitkaz/mlx_gguf_server"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
beehive-lab/GPULlama3.java
GPU-accelerated Llama3.java inference in pure Java using TornadoVM.
srgtuszy/llama-cpp-swift
Swift bindings for llama-cpp library
JackZeng0208/llama.cpp-android-tutorial
llama.cpp tutorial on Android phone
awinml/llama-cpp-python-bindings
Run fast LLM Inference using Llama.cpp in Python
RhinoDevel/mt_llm
Pure C wrapper library to use llama.cpp with Linux and Windows as simple as possible.