google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

/ 100

Verified

This tool helps machine learning engineers prepare raw text data for training neural network-based text generation models. It takes your raw text (like sentences or documents) and breaks it down into smaller, consistent pieces (subword units) suitable for fixed-vocabulary models. You can then feed these standardized units into your neural network, streamlining the text preparation pipeline for natural language processing tasks.

11,697 stars. Used by 198 other packages. Actively maintained with 12 commits in the last 30 days. Available on PyPI.

Use this if you are building text generation models and need a reliable, language-independent way to break down raw text into a fixed vocabulary of subword units.

Not ideal if you need a traditional, language-specific word tokenizer that relies on explicit word boundaries and pre-tokenization.

natural-language-processing machine-translation text-generation text-preparation neural-networks

Maintenance 17 / 25

Adoption 15 / 25

Maturity 25 / 25

Community 21 / 25

How are scores calculated?

Stars

11,697

Forks

1,333

Language

C++

License

Apache-2.0

Compare

sentencepiece and Tokenizer sentencepiece and YouTokenToMe sentencepiece and sentencepiece-jni

Related tools

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...

daac-tools/vaporetto

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

soaxelbrooke/python-bpe

Byte Pair Encoding for Python!

Explore NLP Tools

All categories Trending NLP directory Insights