hofong428/Optimizing-GPU-Kernels

LLM Serving & Inference Optimization

/ 100

Experimental

This project helps machine learning engineers and researchers optimize the performance of large language models (LLMs) when running them on GPUs. It takes your LLM inference code and provides insights and comparisons between using raw CUDA for maximum speed and Triton for easier, faster development of GPU kernels. The output helps you decide which technology to use for specific optimization needs.

No commits in the last 6 months.

Use this if you are a machine learning engineer or researcher looking to improve the serving and inference speed of your LLMs on NVIDIA GPUs and need to understand the trade-offs between development speed and raw performance.

Not ideal if you are not working with GPU-accelerated deep learning or LLMs, or if you require extreme, highly specialized optimizations where only deep CUDA expertise will suffice.

LLM-serving GPU-optimization deep-learning-inference machine-learning-engineering model-deployment

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 4 / 25

Maturity 8 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

—

License

—

Higher-rated alternatives

vllm-project/vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

kvcache-ai/Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

SemiAnalysisAI/InferenceX

Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS - GB200 NVL72 vs MI355X...

sophgo/tpu-mlir

Machine learning compiler based on MLIR for Sophgo TPU.

uccl-project/uccl

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache...

Explore LLM Tools

All categories Trending LLM Tool directory Insights