BBuf/how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

/ 100

Established

This project provides optimized CUDA implementations for fundamental deep learning operations like reduce, softmax, and element-wise calculations. It offers practitioners faster computation for common tasks in machine learning by taking existing code and demonstrating how to improve its GPU performance. Researchers and engineers working with deep learning frameworks and large models would find this valuable for accelerating their model training and inference.

2,863 stars. Actively maintained with 11 commits in the last 30 days.

Use this if you are a machine learning engineer or researcher looking to understand and apply advanced CUDA optimization techniques to speed up your deep learning models and operations.

Not ideal if you are a beginner looking for an introductory guide to GPU programming or if you primarily work with pre-optimized, high-level deep learning libraries without needing to delve into custom kernel development.

deep-learning-optimization GPU-computing neural-network-performance ML-infrastructure large-language-models

No License No Package No Dependents

Maintenance 17 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 20 / 25

How are scores calculated?

Stars

2,863

Forks

260

Language

Cuda

License

—

Related tools

vllm-project/vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

SemiAnalysisAI/InferenceX

Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS - GB200 NVL72 vs MI355X...

kvcache-ai/Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

uccl-project/uccl

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache...

sophgo/tpu-mlir

Machine learning compiler based on MLIR for Sophgo TPU.

Explore LLM Tools

All categories Trending LLM Tool directory Insights