guillaume-be/rust-tokenizers
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models
This is a high-performance library that helps developers prepare text for use with large language models, such as BERT, GPT, and RoBERTa. It takes raw text input and converts it into numerical tokens, which are then fed into machine learning models. The primary users are developers building applications that process natural language, such as chatbots, sentiment analysis tools, or machine translation systems.
336 stars and 8,112 monthly downloads.
Use this if you are a developer working with Rust or Python and need to efficiently tokenize text for modern language models like BERT, GPT, or SentencePiece.
Not ideal if you are a non-developer seeking a ready-to-use application for text analysis or if you don't work with language models.
Stars
336
Forks
33
Language
Rust
License
Apache-2.0
Category
Last pushed
Jan 22, 2026
Monthly downloads
8,112
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/guillaume-be/rust-tokenizers"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Related tools
sugarme/tokenizer
NLP tokenizers written in Go language
elixir-nx/tokenizers
Elixir bindings for 🤗 Tokenizers
openscilab/tocount
ToCount: Lightweight Token Estimator
reinfer/blingfire-rs
Rust wrapper for the BlingFire tokenization library
Scurrra/ubpe
Universal (general sequence) Byte-Pair Encoding