gitkaz/mlx_gguf_server

This is a FastAPI based LLM server. Load multiple LLM models (MLX or llama.cpp) simultaneously using multiprocessing.

/ 100

Established

This project helps developers serve multiple large language models (LLMs) on Apple Silicon Macs. It allows you to load various MLX or GGUF format models simultaneously and interact with them via a web API, processing text prompts for completions or chat, and even transcribing audio. It's designed for developers building applications that need to leverage different LLMs efficiently on macOS.

Use this if you are a developer building applications on an Apple Silicon Mac and need to host and manage multiple LLM models and transcribe audio efficiently via an API.

Not ideal if you need a production-ready, highly scalable LLM serving solution for non-Apple hardware, or if you are not comfortable with API-driven interactions.

LLM deployment MLX framework GGUF models API development Speech-to-text

No Package No Dependents

Maintenance 13 / 25

Adoption 6 / 25

Maturity 16 / 25

Community 15 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Related models

beehive-lab/GPULlama3.java

GPU-accelerated Llama3.java inference in pure Java using TornadoVM.

srgtuszy/llama-cpp-swift

Swift bindings for llama-cpp library

JackZeng0208/llama.cpp-android-tutorial

llama.cpp tutorial on Android phone

awinml/llama-cpp-python-bindings

Run fast LLM Inference using Llama.cpp in Python

RhinoDevel/mt_llm

Pure C wrapper library to use llama.cpp with Linux and Windows as simple as possible.

Explore Transformer Models

All categories Trending Transformer directory Insights