ml-rust/splintr
A high-performance tokenizer (BPE + SentencePiece) built with Rust with Python bindings, focused on speed, safety, and resource optimization.
This tool helps AI engineers and machine learning practitioners quickly convert large volumes of text into tokens, and vice-versa. It takes raw text inputs like prompts, documents, or training data and outputs numerical tokens, which are essential for processing by large language models (LLMs). This is ideal for anyone working with LLMs who needs to prepare data efficiently or process model outputs in real-time.
Use this if you are an AI engineer or ML practitioner building LLM applications, training models, or processing large text datasets and need a significantly faster way to tokenize text than existing Python-based solutions.
Not ideal if you are working with very small, infrequent text inputs or if your current tokenization speed is not a bottleneck for your workflow.
Stars
57
Forks
5
Language
Python
License
MIT
Category
Last pushed
Mar 12, 2026
Monthly downloads
130
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/ml-rust/splintr"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
georg-jung/FastBertTokenizer
Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.
sanderland/script_tok
Code for the paper "BPE stays on SCRIPT"
ash-01xor/bpe.c
Simple Byte pair Encoding mechanism used for tokenization process . written purely in C
U4RASD/r-bpe
R-BPE: Improving BPE-Tokenizers with Token Reuse
jmaczan/bpe-tokenizer
Byte-Pair Encoding tokenizer for training large language models on huge datasets