wangsiping97/FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
This tool helps high-performance computing engineers and machine learning infrastructure developers optimize matrix-vector multiplication, a fundamental operation in many AI models. It takes large matrices (up to 16,384x16,384) and vectors, in various data formats (fp16, int8, int4), and produces the resulting product at significantly faster speeds on NVIDIA GPUs. This is for professionals building and deploying AI systems who need to squeeze maximum performance out of their hardware.
128 stars. No commits in the last 6 months.
Use this if you are a developer optimizing deep learning model inference or other GPU-accelerated linear algebra tasks and need to speed up matrix-vector multiplications beyond what standard libraries offer.
Not ideal if you are a data scientist or researcher working at a higher level of abstraction and not directly optimizing CUDA kernel performance.
Stars
128
Forks
8
Language
Cuda
License
MIT
Category
Last pushed
Jul 13, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/wangsiping97/FastGEMV"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
brucefan1983/GPUMD
Graphics Processing Units Molecular Dynamics
iree-org/iree
A retargetable MLIR-based machine learning compiler and runtime toolkit.
uxlfoundation/oneDAL
oneAPI Data Analytics Library (oneDAL)
rapidsai/cuml
cuML - RAPIDS Machine Learning Library
NVIDIA/cutlass
CUDA Templates and Python DSLs for High-Performance Linear Algebra