Tokenization Algorithms NLP Tools
Tools and libraries for implementing tokenization algorithms (BPE, WordPiece, SentencePiece, Unigram, byte-level) across various programming languages. Includes tokenizer implementations, benchmarks, and algorithm variants. Does NOT include downstream NLP tasks, language models, or applications that use tokenizers.
There are 61 tokenization algorithms tools tracked. 1 score above 70 (verified tier). The highest-rated is google/sentencepiece at 78/100 with 11,697 stars. 1 of the top 10 are actively maintained.
Get all 61 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=tokenization-algorithms&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation. |
|
Verified |
| 2 |
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer |
|
Established |
| 3 |
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support |
|
Established |
| 4 |
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with... |
|
Established |
| 5 |
soaxelbrooke/python-bpe
Byte Pair Encoding for Python! |
|
Established |
| 6 |
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer |
|
Established |
| 7 |
LanguageMachines/ucto
Unicode tokeniser. Ucto tokenizes text files: it separates words from... |
|
Established |
| 8 |
taishi-i/toiro
A tool for comparing tokenizers |
|
Established |
| 9 |
bnosac/sentencepiece
R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece |
|
Emerging |
| 10 |
proycon/python-ucto
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the... |
|
Emerging |
| 11 |
VKCOM/YouTokenToMe
Unsupervised text tokenizer focused on computational efficiency |
|
Emerging |
| 12 |
jorge-menjivar/tekken-rs
Rust implementation of the Mistral Tekken tokenizer |
|
Emerging |
| 13 |
JuliaText/WordTokenizers.jl
High performance tokenizers for natural language processing and other related tasks |
|
Emerging |
| 14 |
ropensci/tokenizers
Fast, Consistent Tokenization of Natural Language Text |
|
Emerging |
| 15 |
dariush-bahrami/character-tokenizer
A character tokenizer for Hugging Face Transformers |
|
Emerging |
| 16 |
arbox/tokenizer
A simple tokenizer in Ruby for NLP tasks. |
|
Emerging |
| 17 |
levyfan/sentencepiece-jni
Java JNI wrapper for SentencePiece: unsupervised text tokenizer for Neural... |
|
Emerging |
| 18 |
Moshe-ship/artok
Arabic Token Tax Calculator - see how much more Arabic costs across LLM tokenizers |
|
Emerging |
| 19 |
JuliaStrings/TinySegmenter.jl
Julia version of TinySegmenter, compact Japanese tokenizer |
|
Emerging |
| 20 |
dustalov/greeb
Greeb is a simple Unicode-aware regexp-based tokenizer. |
|
Emerging |
| 21 |
daac-tools/python-vaporetto
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer.... |
|
Emerging |
| 22 |
chengchingwen/BytePairEncoding.jl
Julia implementation of Byte Pair Encoding for NLP |
|
Emerging |
| 23 |
skorani/tokenizer
An open source High level Persian Tokenizer |
|
Emerging |
| 24 |
10-OASIS-01/BPEtokenizer
This project implements a tokenizer based on the Byte Pair Encoding (BPE)... |
|
Emerging |
| 25 |
zencephalon/Tactful_Tokenizer
Accurate Bayesian sentence tokenizer in Ruby. |
|
Emerging |
| 26 |
thisiscetin/textoken
Simple and customizable text tokenization gem. |
|
Emerging |
| 27 |
gbenson/dom-tokenizers
DOM-aware tokenization for Hugging Face language models |
|
Emerging |
| 28 |
ImadSaddik/DarijaTokenizers
Free to use tokenizers trained on the Darija language. |
|
Emerging |
| 29 |
ztjhz/word-piece-tokenizer
A Lightweight Word Piece Tokenizer |
|
Experimental |
| 30 |
pranav271103/Ultra-Tokenizer
This project implements a state-of-the-art tokenizer from scratch in Python,... |
|
Experimental |
| 31 |
scientist-labs/tokenkit
Fast, Rust-backed word-level tokenization for Ruby. Unlike subword... |
|
Experimental |
| 32 |
daac-tools/python-vibrato
Viterbi-based accelerated tokenizer (Python wrapper) |
|
Experimental |
| 33 |
AddyDelaCruz/swift-tiktoken
🎉 Implement a lightweight, pure Swift tokenizer for OpenAI's tiktoken,... |
|
Experimental |
| 34 |
savannstm/language-tokenizer
Text tokenizer for linguistic purposes, such as text matching. Supports more... |
|
Experimental |
| 35 |
North-Shore-AI/tiktoken_ex
Pure Elixir TikToken-style byte-level BPE tokenizer (Kimi K2 compatible). |
|
Experimental |
| 36 |
chaablo69/rustbpe
🔧 Train efficient BPE tokenizers in Rust with simple Python bindings,... |
|
Experimental |
| 37 |
hppRC/saku
A Japanese Sentence Tokenizer written in Rust. |
|
Experimental |
| 38 |
dongjinleekr/beanpiece
A Java binding to Google SentencePiece |
|
Experimental |
| 39 |
designer-coderajay/bpe-tokenizer-scratch
Byte-Pair Encoding tokenizer built from scratch in Python. The same... |
|
Experimental |
| 40 |
michaelnmmeyer/mascara
A natural language tokenizer |
|
Experimental |
| 41 |
tommasofacchin/ft-tokenize
Small C++ tokenizer with support for word-level and BPE tokenization,... |
|
Experimental |
| 42 |
yenniejun/tokenizers-languages
Comparing LLM tokenizers in multiple languages |
|
Experimental |
| 43 |
CarolinElsner/Speech-Tokenization
The tokenisation of spoken text. Received by the Watson STT and sent to the... |
|
Experimental |
| 44 |
SeanLee97/BertWordPieceTokenizer.jl
WordPiece Tokenizer for BERT models. |
|
Experimental |
| 45 |
kiarashrahmani/English-Persian-Tokenizer
This project is a simple tokenizer for text processing that can tokenize... |
|
Experimental |
| 46 |
victor-iyi/wikitext
Train and perform NLP tasks on the wikitext-103 dataset in Rust |
|
Experimental |
| 47 |
hscspring/bytepiece-rs
The Bytepiece Tokenizer Implemented in Rust. |
|
Experimental |
| 48 |
delph-in/repp
Regular Expression Preprocessor |
|
Experimental |
| 49 |
Textualization/RophertaTokenizer
BPE Tokenizer for Ropherta (subclass of GPT3Tokenizer) |
|
Experimental |
| 50 |
shivendrra/shredword-trainer
BPE & Unigram Vocab Training library |
|
Experimental |
| 51 |
UtkarshTheDev/tokenizer
Interactive BPE (Byte-Pair Encoding) tokenizer and CLI utility for... |
|
Experimental |
| 52 |
DolbyUUU/byte_pair_encoding_BPE_subword_tokenization_implementation_python
Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm... |
|
Experimental |
| 53 |
rraghavkaushik/smol-bpe-tokenizer
A lightweight, from-scratch implementation of Byte Pair Encoding (BPE)... |
|
Experimental |
| 54 |
maxim-saplin/tiktoken-bench
Comparing OpenAI tokeniser (tiktoken) performance - stock Python/Rust vs JS/WASM |
|
Experimental |
| 55 |
justinamiller/BPEngine
Pure C# implementation of GPT-style Byte Pair Encoding tokenizer and tiny... |
|
Experimental |
| 56 |
teleprint-me/byte-pair
Byte Pair Encoder (BPE) for Natural Language Processing. |
|
Experimental |
| 57 |
jonasliendl/bpe_tokenizer
✨ BPE-Tokenizer for university module Foundational Generative Models. |
|
Experimental |
| 58 |
riyad-derguini/End-to-End-NLP-Systems
Modular toolkit for End-to-End NLP: Implementing advanced subword... |
|
Experimental |
| 59 |
edoardosignoroni/hftoks-eval
High Frequency Tokenizer - Evaluation |
|
Experimental |
| 60 |
sulaihasubi/tokenization-spaCy
🌶 A tokenizer for oil and gas documents @sulaihasubi |
|
Experimental |
| 61 |
jonasknobloch/tokenizers-mbpe
Morphologically biased byte-pair encoding pre-tokenization |
|
Experimental |