deepreinforce-ai/CUDA-L2
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
This system helps AI/ML practitioners speed up the foundational operation of matrix multiplication on NVIDIA GPUs. It takes your existing half-precision matrix multiplication workloads and processes them using custom-optimized code, delivering significantly faster results than standard libraries. This is designed for AI engineers, machine learning scientists, and researchers who are running large language models or other compute-intensive AI applications.
472 stars.
Use this if you are developing or deploying AI models, especially large language models, and need to accelerate half-precision matrix multiplication performance on NVIDIA A100, RTX 3090, or H100 GPUs.
Not ideal if your workload does not involve half-precision matrix multiplication or if you are using a GPU type not specifically supported (like older or non-NVIDIA GPUs).
Stars
472
Forks
25
Language
Cuda
License
MIT
Category
Last pushed
Jan 08, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/deepreinforce-ai/CUDA-L2"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
quic/efficient-transformers
This library empowers users to seamlessly port pretrained models and checkpoints on the...
ManuelSLemos/RabbitLLM
Run 70B+ LLMs on a single 4GB GPU — no quantization required.
alpa-projects/alpa
Training and serving large-scale neural networks with auto parallelization.
arm-education/Advanced-AI-Hardware-Software-Co-Design
Hands-on course materials for ML engineers to master extreme model quantization and on-device...
IST-DASLab/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes...