ddickmann/vllm-factory
Production inference for encoder models - ColBERT, GLiNER, ColPali, embeddings etc. - as vLLM plugins for online and in-process deployment
This project helps data scientists and ML engineers efficiently deploy specialized AI models like ColBERT, GLiNER, and other structured prediction models for real-time use. You provide text or images, and it quickly returns relevant search results, extracted entities, or linked information. This is for professionals managing AI applications who need to serve these models at high volumes.
Available on PyPI.
Use this if you need to serve advanced encoder-based AI models like retrieval or entity extraction models in a production environment with high throughput and low latency.
Not ideal if you are working with large language models (LLMs) that generate long text sequences, as this is optimized for encoder models that produce embeddings or structured outputs.
Stars
15
Forks
1
Language
Python
License
Apache-2.0
Category
Last pushed
Apr 02, 2026
Commits (30d)
0
Dependencies
6
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/ddickmann/vllm-factory"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
byte5ai/palaia
Palaia — Local, crash-safe memory for AI agents. Semantic vector search...
j33pguy/magi
MAGI — Multi-Agent Graph Intelligence. Universal memory server for AI agents. MCP + gRPC + REST...
LLMSystems/TensorrtServer
A high-performance deep learning model inference server based on TensorRT, supporting fast...
abdullah85398/embedding-server
A high-performance, self-hosted, model-agnostic embedding service designed for LLM applications,...
alez007/yasha
Self-hosted, multi-model AI inference server. Run LLMs, TTS, STT, embeddings, and image...