google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
This tool helps machine learning engineers prepare raw text data for training neural network-based text generation models. It takes your raw text (like sentences or documents) and breaks it down into smaller, consistent pieces (subword units) suitable for fixed-vocabulary models. You can then feed these standardized units into your neural network, streamlining the text preparation pipeline for natural language processing tasks.
11,697 stars. Used by 198 other packages. Actively maintained with 12 commits in the last 30 days. Available on PyPI.
Use this if you are building text generation models and need a reliable, language-independent way to break down raw text into a fixed vocabulary of subword units.
Not ideal if you need a traditional, language-specific word tokenizer that relies on explicit word boundaries and pre-tokenization.
Stars
11,697
Forks
1,333
Language
C++
License
Apache-2.0
Category
Last pushed
Mar 01, 2026
Commits (30d)
12
Reverse dependents
198
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/google/sentencepiece"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Related tools
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
soaxelbrooke/python-bpe
Byte Pair Encoding for Python!