RAZZULLIX/fast_topk_batched

High-performance batched Top-K selection for CPU inference. Up to 80x faster than PyTorch, optimized for LLM sampling with AVX2 SIMD.

/ 100

Emerging

This project helps machine learning engineers accelerate the selection of the most probable words or tokens when generating text with large language models (LLMs) on standard CPUs. You provide raw prediction scores for many possible next tokens across multiple input sequences, and it quickly returns the top 'K' most likely token IDs for each sequence. It's designed for developers building or deploying LLM inference systems who need to maximize performance without dedicated GPU hardware.

Use this if you are a machine learning engineer running LLM inference on CPU and need to significantly speed up the 'top-K' sampling step for text generation.

Not ideal if you are primarily running LLM inference on GPUs, or if your application does not involve LLM text generation.

LLM inference NLP engineering CPU optimization text generation machine learning deployment

No Package No Dependents

Maintenance 10 / 25

Adoption 6 / 25

Maturity 11 / 25

Community 9 / 25

How are scores calculated?

Stars

Forks

Language

C++

License

MIT

Higher-rated alternatives

NVIDIA/TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit...

mlcommons/inference

Reference implementations of MLPerf® inference benchmarks

mlcommons/training

Reference implementations of MLPerf® training benchmarks

datamade/usaddress

:us: a python library for parsing unstructured United States address strings into address components

GRAAL-Research/deepparse

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Explore ML Frameworks

All categories Trending ML Framework directory Insights