Tokenization Algorithms NLP Tools

Tools and libraries for implementing tokenization algorithms (BPE, WordPiece, SentencePiece, Unigram, byte-level) across various programming languages. Includes tokenizer implementations, benchmarks, and algorithm variants. Does NOT include downstream NLP tasks, language models, or applications that use tokenizers.

There are 61 tokenization algorithms tools tracked. 1 score above 70 (verified tier). The highest-rated is google/sentencepiece at 78/100 with 11,697 stars. 1 of the top 10 are actively maintained.

Get all 61 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=tokenization-algorithms&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

78
Verified
2 daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

57
Established
3 OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

55
Established
4 Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with...

55
Established
5 soaxelbrooke/python-bpe

Byte Pair Encoding for Python!

54
Established
6 daac-tools/vaporetto

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

54
Established
7 LanguageMachines/ucto

Unicode tokeniser. Ucto tokenizes text files: it separates words from...

53
Established
8 taishi-i/toiro

A tool for comparing tokenizers

52
Established
9 bnosac/sentencepiece

R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece

49
Emerging
10 proycon/python-ucto

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the...

47
Emerging
11 VKCOM/YouTokenToMe

Unsupervised text tokenizer focused on computational efficiency

46
Emerging
12 jorge-menjivar/tekken-rs

Rust implementation of the Mistral Tekken tokenizer

46
Emerging
13 JuliaText/WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks

45
Emerging
14 ropensci/tokenizers

Fast, Consistent Tokenization of Natural Language Text

42
Emerging
15 dariush-bahrami/character-tokenizer

A character tokenizer for Hugging Face Transformers

41
Emerging
16 arbox/tokenizer

A simple tokenizer in Ruby for NLP tasks.

41
Emerging
17 levyfan/sentencepiece-jni

Java JNI wrapper for SentencePiece: unsupervised text tokenizer for Neural...

41
Emerging
18 Moshe-ship/artok

Arabic Token Tax Calculator - see how much more Arabic costs across LLM tokenizers

39
Emerging
19 JuliaStrings/TinySegmenter.jl

Julia version of TinySegmenter, compact Japanese tokenizer

39
Emerging
20 dustalov/greeb

Greeb is a simple Unicode-aware regexp-based tokenizer.

38
Emerging
21 daac-tools/python-vaporetto

🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer....

37
Emerging
22 chengchingwen/BytePairEncoding.jl

Julia implementation of Byte Pair Encoding for NLP

35
Emerging
23 skorani/tokenizer

An open source High level Persian Tokenizer

34
Emerging
24 10-OASIS-01/BPEtokenizer

This project implements a tokenizer based on the Byte Pair Encoding (BPE)...

33
Emerging
25 zencephalon/Tactful_Tokenizer

Accurate Bayesian sentence tokenizer in Ruby.

33
Emerging
26 thisiscetin/textoken

Simple and customizable text tokenization gem.

32
Emerging
27 gbenson/dom-tokenizers

DOM-aware tokenization for Hugging Face language models

32
Emerging
28 ImadSaddik/DarijaTokenizers

Free to use tokenizers trained on the Darija language.

31
Emerging
29 ztjhz/word-piece-tokenizer

A Lightweight Word Piece Tokenizer

29
Experimental
30 pranav271103/Ultra-Tokenizer

This project implements a state-of-the-art tokenizer from scratch in Python,...

28
Experimental
31 scientist-labs/tokenkit

Fast, Rust-backed word-level tokenization for Ruby. Unlike subword...

28
Experimental
32 daac-tools/python-vibrato

Viterbi-based accelerated tokenizer (Python wrapper)

27
Experimental
33 AddyDelaCruz/swift-tiktoken

🎉 Implement a lightweight, pure Swift tokenizer for OpenAI's tiktoken,...

27
Experimental
34 savannstm/language-tokenizer

Text tokenizer for linguistic purposes, such as text matching. Supports more...

26
Experimental
35 North-Shore-AI/tiktoken_ex

Pure Elixir TikToken-style byte-level BPE tokenizer (Kimi K2 compatible).

24
Experimental
36 chaablo69/rustbpe

🔧 Train efficient BPE tokenizers in Rust with simple Python bindings,...

22
Experimental
37 hppRC/saku

A Japanese Sentence Tokenizer written in Rust.

21
Experimental
38 dongjinleekr/beanpiece

A Java binding to Google SentencePiece

20
Experimental
39 designer-coderajay/bpe-tokenizer-scratch

Byte-Pair Encoding tokenizer built from scratch in Python. The same...

20
Experimental
40 michaelnmmeyer/mascara

A natural language tokenizer

20
Experimental
41 tommasofacchin/ft-tokenize

Small C++ tokenizer with support for word-level and BPE tokenization,...

20
Experimental
42 yenniejun/tokenizers-languages

Comparing LLM tokenizers in multiple languages

20
Experimental
43 CarolinElsner/Speech-Tokenization

The tokenisation of spoken text. Received by the Watson STT and sent to the...

19
Experimental
44 SeanLee97/BertWordPieceTokenizer.jl

WordPiece Tokenizer for BERT models.

19
Experimental
45 kiarashrahmani/English-Persian-Tokenizer

This project is a simple tokenizer for text processing that can tokenize...

19
Experimental
46 victor-iyi/wikitext

Train and perform NLP tasks on the wikitext-103 dataset in Rust

17
Experimental
47 hscspring/bytepiece-rs

The Bytepiece Tokenizer Implemented in Rust.

17
Experimental
48 delph-in/repp

Regular Expression Preprocessor

17
Experimental
49 Textualization/RophertaTokenizer

BPE Tokenizer for Ropherta (subclass of GPT3Tokenizer)

17
Experimental
50 shivendrra/shredword-trainer

BPE & Unigram Vocab Training library

15
Experimental
51 UtkarshTheDev/tokenizer

Interactive BPE (Byte-Pair Encoding) tokenizer and CLI utility for...

15
Experimental
52 DolbyUUU/byte_pair_encoding_BPE_subword_tokenization_implementation_python

Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm...

14
Experimental
53 rraghavkaushik/smol-bpe-tokenizer

A lightweight, from-scratch implementation of Byte Pair Encoding (BPE)...

12
Experimental
54 maxim-saplin/tiktoken-bench

Comparing OpenAI tokeniser (tiktoken) performance - stock Python/Rust vs JS/WASM

11
Experimental
55 justinamiller/BPEngine

Pure C# implementation of GPT-style Byte Pair Encoding tokenizer and tiny...

11
Experimental
56 teleprint-me/byte-pair

Byte Pair Encoder (BPE) for Natural Language Processing.

11
Experimental
57 jonasliendl/bpe_tokenizer

✨ BPE-Tokenizer for university module Foundational Generative Models.

11
Experimental
58 riyad-derguini/End-to-End-NLP-Systems

Modular toolkit for End-to-End NLP: Implementing advanced subword...

11
Experimental
59 edoardosignoroni/hftoks-eval

High Frequency Tokenizer - Evaluation

11
Experimental
60 sulaihasubi/tokenization-spaCy

🌶 A tokenizer for oil and gas documents @sulaihasubi

10
Experimental
61 jonasknobloch/tokenizers-mbpe

Morphologically biased byte-pair encoding pre-tokenization

10
Experimental