BenChaliah/NVFP4-on-4090-vLLM

AdaLLM is an NVFP4-first inference runtime for Ada Lovelace (RTX 4090) with FP8 KV cache and custom decode kernels. This repo targets NVFP4 weights and keeps the entire decode path in FP8

28
/ 100
Experimental

This is an inference engine designed for developers running large language models (LLMs) on NVIDIA RTX 4090 GPUs. It allows you to load and run quantized LLMs, specifically Qwen3 and Gemma3, enabling faster processing and significantly reduced VRAM usage compared to standard methods. It takes pre-quantized LLM weights and outputs generated text or an OpenAI-compatible server for your applications. The primary users are developers working on LLM deployment and optimization.

Use this if you are a developer looking to maximize the efficiency and VRAM utilization of LLM inference on an NVIDIA RTX 4090 GPU, particularly with Qwen3 or Gemma3 models.

Not ideal if you are not a developer, do not have an NVIDIA RTX 4090 GPU, or are looking for a general-purpose LLM serving solution without specific hardware optimization needs.

LLM deployment GPU optimization AI inference model serving quantization
No License No Package No Dependents
Maintenance 10 / 25
Adoption 9 / 25
Maturity 3 / 25
Community 6 / 25

How are scores calculated?

Stars

98

Forks

3

Language

Python

License

Last pushed

Feb 15, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/BenChaliah/NVFP4-on-4090-vLLM"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.