guillaume-be/rust-tokenizers

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

61
/ 100
Established

This is a high-performance library that helps developers prepare text for use with large language models, such as BERT, GPT, and RoBERTa. It takes raw text input and converts it into numerical tokens, which are then fed into machine learning models. The primary users are developers building applications that process natural language, such as chatbots, sentiment analysis tools, or machine translation systems.

336 stars and 8,112 monthly downloads.

Use this if you are a developer working with Rust or Python and need to efficiently tokenize text for modern language models like BERT, GPT, or SentencePiece.

Not ideal if you are a non-developer seeking a ready-to-use application for text analysis or if you don't work with language models.

natural-language-processing machine-learning-engineering text-pre-processing AI-development computational-linguistics
No Package No Dependents
Maintenance 10 / 25
Adoption 19 / 25
Maturity 16 / 25
Community 16 / 25

How are scores calculated?

Stars

336

Forks

33

Language

Rust

License

Apache-2.0

Last pushed

Jan 22, 2026

Monthly downloads

8,112

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/guillaume-be/rust-tokenizers"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.