huggingface/tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
When working with large volumes of text for natural language processing, this tool helps you convert raw text into a format that machine learning models can understand. It takes your raw text documents as input and produces a 'vocabulary' and 'tokens'—which are numerical representations of words or sub-word units. This is essential for AI researchers and machine learning engineers building or fine-tuning language models.
10,520 stars and 1,504,044 monthly downloads. Used by 127 other packages. Actively maintained with 45 commits in the last 30 days. Available on PyPI and npm.
Use this if you need to quickly and efficiently prepare large text datasets for training or using state-of-the-art natural language processing models.
Not ideal if your primary goal is basic text analysis without the need for advanced machine learning model input preparation.
Stars
10,520
Forks
1,051
Language
Rust
License
Apache-2.0
Category
Last pushed
Feb 28, 2026
Monthly downloads
1,504,044
Commits (30d)
45
Dependencies
14
Reverse dependents
127
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/huggingface/tokenizers"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Recent Releases
Compare
Related models
megagonlabs/ginza-transformers
Use custom tokenizers in spacy-transformers
Kaleidophon/token2index
A lightweight but powerful library to build token indices for NLP tasks, compatible with major...
Hugging-Face-Supporter/tftokenizers
Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels
NVIDIA/Cosmos-Tokenizer
A suite of image and video neural tokenizers
wangcongcong123/ttt
A package for fine-tuning Transformers with TPUs, written in Tensorflow2.0+