wangsiping97/FastGEMV

High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.

/ 100

Emerging

This tool helps high-performance computing engineers and machine learning infrastructure developers optimize matrix-vector multiplication, a fundamental operation in many AI models. It takes large matrices (up to 16,384x16,384) and vectors, in various data formats (fp16, int8, int4), and produces the resulting product at significantly faster speeds on NVIDIA GPUs. This is for professionals building and deploying AI systems who need to squeeze maximum performance out of their hardware.

128 stars. No commits in the last 6 months.

Use this if you are a developer optimizing deep learning model inference or other GPU-accelerated linear algebra tasks and need to speed up matrix-vector multiplications beyond what standard libraries offer.

Not ideal if you are a data scientist or researcher working at a higher level of abstraction and not directly optimizing CUDA kernel performance.

GPU-optimization deep-learning-inference high-performance-computing CUDA-development

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 10 / 25

How are scores calculated?

Stars

128

Forks

Language

Cuda

License

MIT

Higher-rated alternatives

brucefan1983/GPUMD

Graphics Processing Units Molecular Dynamics

iree-org/iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

uxlfoundation/oneDAL

oneAPI Data Analytics Library (oneDAL)

rapidsai/cuml

cuML - RAPIDS Machine Learning Library

NVIDIA/cutlass

CUDA Templates and Python DSLs for High-Performance Linear Algebra

Explore ML Frameworks

All categories Trending ML Framework directory Insights