sentencepiece and YouTokenToMe

These are competitors offering alternative implementations of unsupervised subword tokenization (SentencePiece uses unigram language modeling while YouTokenToMe uses BPE), with SentencePiece dominating adoption in production NLP pipelines while YouTokenToMe targets use cases prioritizing inference speed over ecosystem integration.

sentencepiece

Verified

YouTokenToMe

Emerging

Maintenance 17/25

Adoption 15/25

Maturity 25/25

Community 21/25

Maintenance 0/25

Adoption 10/25

Maturity 16/25

Community 20/25

Stars: 11,697

Forks: 1,333

Downloads: —

Commits (30d): 12

Language: C++

License: Apache-2.0

Stars: 975

Forks: 109

Downloads: —

Commits (30d): 0

Language: C++

License: MIT

No risk flags

Archived Stale 6m No Package No Dependents

About sentencepiece

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

This tool helps machine learning engineers prepare raw text data for training neural network-based text generation models. It takes your raw text (like sentences or documents) and breaks it down into smaller, consistent pieces (subword units) suitable for fixed-vocabulary models. You can then feed these standardized units into your neural network, streamlining the text preparation pipeline for natural language processing tasks.

natural-language-processing machine-translation text-generation text-preparation neural-networks

About YouTokenToMe

VKCOM/YouTokenToMe

Unsupervised text tokenizer focused on computational efficiency

Implements Byte Pair Encoding with O(N) complexity using multithreaded C++ backend and space-as-boundary tokenization (preserving word boundaries via "▁" meta-symbol). Provides Python bindings and CLI tools supporting BPE-dropout regularization and reversible encoding/decoding. Outperforms Hugging Face tokenizers, fastBPE, and SentencePiece by up to 60× on training and inference through efficient parallel processing.

Related comparisons

sentencepiece and Tokenizer sentencepiece and sentencepiece-jni sentencepiece and Tokenizer

Scores updated daily from GitHub, PyPI, and npm data. How scores work