google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

78
/ 100
Verified

This tool helps machine learning engineers prepare raw text data for training neural network-based text generation models. It takes your raw text (like sentences or documents) and breaks it down into smaller, consistent pieces (subword units) suitable for fixed-vocabulary models. You can then feed these standardized units into your neural network, streamlining the text preparation pipeline for natural language processing tasks.

11,697 stars. Used by 198 other packages. Actively maintained with 12 commits in the last 30 days. Available on PyPI.

Use this if you are building text generation models and need a reliable, language-independent way to break down raw text into a fixed vocabulary of subword units.

Not ideal if you need a traditional, language-specific word tokenizer that relies on explicit word boundaries and pre-tokenization.

natural-language-processing machine-translation text-generation text-preparation neural-networks
Maintenance 17 / 25
Adoption 15 / 25
Maturity 25 / 25
Community 21 / 25

How are scores calculated?

Stars

11,697

Forks

1,333

Language

C++

License

Apache-2.0

Last pushed

Mar 01, 2026

Commits (30d)

12

Reverse dependents

198

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/google/sentencepiece"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.