BenChaliah/NVFP4-on-4090-vLLM

AdaLLM is an NVFP4-first inference runtime for Ada Lovelace (RTX 4090) with FP8 KV cache and custom decode kernels. This repo targets NVFP4 weights and keeps the entire decode path in FP8

/ 100

Experimental

This is an inference engine designed for developers running large language models (LLMs) on NVIDIA RTX 4090 GPUs. It allows you to load and run quantized LLMs, specifically Qwen3 and Gemma3, enabling faster processing and significantly reduced VRAM usage compared to standard methods. It takes pre-quantized LLM weights and outputs generated text or an OpenAI-compatible server for your applications. The primary users are developers working on LLM deployment and optimization.

Use this if you are a developer looking to maximize the efficiency and VRAM utilization of LLM inference on an NVIDIA RTX 4090 GPU, particularly with Qwen3 or Gemma3 models.

Not ideal if you are not a developer, do not have an NVIDIA RTX 4090 GPU, or are looking for a general-purpose LLM serving solution without specific hardware optimization needs.

LLM deployment GPU optimization AI inference model serving quantization

No License No Package No Dependents

Maintenance 10 / 25

Adoption 9 / 25

Maturity 3 / 25

Community 6 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

sgl-project/sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

alibaba/MNN

MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...

xorbitsai/inference

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...

tensorzero/tensorzero

TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...

Explore Transformer Models

All categories Trending Transformer directory Insights