Tokenizer and YouTokenToMe

These are **competitors** — both provide standalone, general-purpose tokenization solutions (BPE/SentencePiece vs. unsupervised methods) for the same use case of preprocessing text, with no integration points between them.

Tokenizer
55
Established
YouTokenToMe
46
Emerging
Maintenance 6/25
Adoption 10/25
Maturity 16/25
Community 23/25
Maintenance 0/25
Adoption 10/25
Maturity 16/25
Community 20/25
Stars: 330
Forks: 80
Downloads:
Commits (30d): 0
Language: C++
License: MIT
Stars: 975
Forks: 109
Downloads:
Commits (30d): 0
Language: C++
License: MIT
No Package No Dependents
Archived Stale 6m No Package No Dependents

About Tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

This is a versatile tool for language processing engineers, machine learning scientists, and data scientists who need to prepare raw text for analysis or model training. It takes raw text as input and breaks it down into individual words or subword units (tokens), which are the building blocks for natural language processing tasks. This allows you to precisely control how text is segmented and processed before it’s fed into your algorithms.

natural-language-processing machine-translation text-preprocessing language-modeling data-preparation

About YouTokenToMe

VKCOM/YouTokenToMe

Unsupervised text tokenizer focused on computational efficiency

Implements Byte Pair Encoding with O(N) complexity using multithreaded C++ backend and space-as-boundary tokenization (preserving word boundaries via "▁" meta-symbol). Provides Python bindings and CLI tools supporting BPE-dropout regularization and reversible encoding/decoding. Outperforms Hugging Face tokenizers, fastBPE, and SentencePiece by up to 60× on training and inference through efficient parallel processing.

Scores updated daily from GitHub, PyPI, and npm data. How scores work