BenChaliah/NVFP4-on-4090-vLLM
AdaLLM is an NVFP4-first inference runtime for Ada Lovelace (RTX 4090) with FP8 KV cache and custom decode kernels. This repo targets NVFP4 weights and keeps the entire decode path in FP8
This is an inference engine designed for developers running large language models (LLMs) on NVIDIA RTX 4090 GPUs. It allows you to load and run quantized LLMs, specifically Qwen3 and Gemma3, enabling faster processing and significantly reduced VRAM usage compared to standard methods. It takes pre-quantized LLM weights and outputs generated text or an OpenAI-compatible server for your applications. The primary users are developers working on LLM deployment and optimization.
Use this if you are a developer looking to maximize the efficiency and VRAM utilization of LLM inference on an NVIDIA RTX 4090 GPU, particularly with Qwen3 or Gemma3 models.
Not ideal if you are not a developer, do not have an NVIDIA RTX 4090 GPU, or are looking for a general-purpose LLM serving solution without specific hardware optimization needs.
Stars
98
Forks
3
Language
Python
License
—
Category
Last pushed
Feb 15, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/BenChaliah/NVFP4-on-4090-vLLM"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
sgl-project/sglang
SGLang is a high-performance serving framework for large language models and multimodal models.
alibaba/MNN
MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...
xorbitsai/inference
Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...
tensorzero/tensorzero
TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...