Tokenizer and YouTokenToMe

These are **competitors** — both provide standalone, general-purpose tokenization solutions (BPE/SentencePiece vs. unsupervised methods) for the same use case of preprocessing text, with no integration points between them.

Tokenizer

Established

YouTokenToMe

Emerging

Maintenance 6/25

Adoption 10/25

Maturity 16/25

Community 23/25

Maintenance 0/25

Adoption 10/25

Maturity 16/25

Community 20/25

Stars: 330

Forks: 80

Downloads: —

Commits (30d): 0

Language: C++

License: MIT

Stars: 975

Forks: 109

Downloads: —

Commits (30d): 0

Language: C++

License: MIT

No Package No Dependents

Archived Stale 6m No Package No Dependents

About Tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

This is a versatile tool for language processing engineers, machine learning scientists, and data scientists who need to prepare raw text for analysis or model training. It takes raw text as input and breaks it down into individual words or subword units (tokens), which are the building blocks for natural language processing tasks. This allows you to precisely control how text is segmented and processed before it’s fed into your algorithms.

natural-language-processing machine-translation text-preprocessing language-modeling data-preparation

About YouTokenToMe

VKCOM/YouTokenToMe

Unsupervised text tokenizer focused on computational efficiency

Implements Byte Pair Encoding with O(N) complexity using multithreaded C++ backend and space-as-boundary tokenization (preserving word boundaries via "▁" meta-symbol). Provides Python bindings and CLI tools supporting BPE-dropout regularization and reversible encoding/decoding. Outperforms Hugging Face tokenizers, fastBPE, and SentencePiece by up to 60× on training and inference through efficient parallel processing.

Related comparisons

Tokenizer and sentencepiece Tokenizer and kitoken Tokenizer and sentencepiece

Scores updated daily from GitHub, PyPI, and npm data. How scores work