huggingface/tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

90
/ 100
Verified

When working with large volumes of text for natural language processing, this tool helps you convert raw text into a format that machine learning models can understand. It takes your raw text documents as input and produces a 'vocabulary' and 'tokens'—which are numerical representations of words or sub-word units. This is essential for AI researchers and machine learning engineers building or fine-tuning language models.

10,520 stars and 1,504,044 monthly downloads. Used by 127 other packages. Actively maintained with 45 commits in the last 30 days. Available on PyPI and npm.

Use this if you need to quickly and efficiently prepare large text datasets for training or using state-of-the-art natural language processing models.

Not ideal if your primary goal is basic text analysis without the need for advanced machine learning model input preparation.

natural-language-processing machine-learning-engineering text-pre-processing AI-model-training
Maintenance 20 / 25
Adoption 25 / 25
Maturity 25 / 25
Community 20 / 25

How are scores calculated?

Stars

10,520

Forks

1,051

Language

Rust

License

Apache-2.0

Last pushed

Feb 28, 2026

Monthly downloads

1,504,044

Commits (30d)

45

Dependencies

14

Reverse dependents

127

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/huggingface/tokenizers"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.