deepreinforce-ai/CUDA-L2

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

/ 100

Emerging

This system helps AI/ML practitioners speed up the foundational operation of matrix multiplication on NVIDIA GPUs. It takes your existing half-precision matrix multiplication workloads and processes them using custom-optimized code, delivering significantly faster results than standard libraries. This is designed for AI engineers, machine learning scientists, and researchers who are running large language models or other compute-intensive AI applications.

472 stars.

Use this if you are developing or deploying AI models, especially large language models, and need to accelerate half-precision matrix multiplication performance on NVIDIA A100, RTX 3090, or H100 GPUs.

Not ideal if your workload does not involve half-precision matrix multiplication or if you are using a GPU type not specifically supported (like older or non-NVIDIA GPUs).

AI/ML operations GPU optimization Large Language Models AI model deployment High-performance computing

No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 13 / 25

Community 13 / 25

How are scores calculated?

Stars

472

Forks

Language

Cuda

License

MIT

Higher-rated alternatives

quic/efficient-transformers

This library empowers users to seamlessly port pretrained models and checkpoints on the...

ManuelSLemos/RabbitLLM

Run 70B+ LLMs on a single 4GB GPU — no quantization required.

alpa-projects/alpa

Training and serving large-scale neural networks with auto parallelization.

arm-education/Advanced-AI-Hardware-Software-Co-Design

Hands-on course materials for ML engineers to master extreme model quantization and on-device...

IST-DASLab/marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes...

Explore Transformer Models

All categories Trending Transformer directory Insights