hofong428/Optimizing-GPU-Kernels
LLM Serving & Inference Optimization
This project helps machine learning engineers and researchers optimize the performance of large language models (LLMs) when running them on GPUs. It takes your LLM inference code and provides insights and comparisons between using raw CUDA for maximum speed and Triton for easier, faster development of GPU kernels. The output helps you decide which technology to use for specific optimization needs.
No commits in the last 6 months.
Use this if you are a machine learning engineer or researcher looking to improve the serving and inference speed of your LLMs on NVIDIA GPUs and need to understand the trade-offs between development speed and raw performance.
Not ideal if you are not working with GPU-accelerated deep learning or LLMs, or if you require extreme, highly specialized optimizations where only deep CUDA expertise will suffice.
Stars
8
Forks
2
Language
—
License
—
Category
Last pushed
Oct 15, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/hofong428/Optimizing-GPU-Kernels"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
vllm-project/vllm-ascend
Community maintained hardware plugin for vLLM on Ascend
kvcache-ai/Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
SemiAnalysisAI/InferenceX
Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS - GB200 NVL72 vs MI355X...
sophgo/tpu-mlir
Machine learning compiler based on MLIR for Sophgo TPU.
uccl-project/uccl
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache...