Text Tokenization Libraries ML Frameworks

Language processing tools that convert text into tokens for NLP and ML models. Includes tokenizers across multiple programming languages and implementations. Does NOT include general text processing, speech tokenization, or vectorization/embedding systems.

There are 15 text tokenization libraries frameworks tracked. The highest-rated is SauravP97/hf-tokenizer-visualizer at 32/100 with 2 stars.

Get all 15 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=text-tokenization-libraries&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 SauravP97/hf-tokenizer-visualizer

Visualize HuggingFace Byte-Pair Encoding (BPE) Tokenizer encoding process

32
Emerging
2 twinnydotdev/toxe

SentencePiece tokenizer for cross-encoders

30
Emerging
3 jawrainey/hfta

Reference implementation: run any huggingface tokenizer in Android (rust).

26
Experimental
4 andikaseptiadi/local-code-model

🛠️ Build a pure Go GPT-style transformer from scratch to grasp the...

23
Experimental
5 Example69420/splintr

🚀 Boost text processing speed with Splintr, a high-performance BPE tokenizer...

22
Experimental
6 jrajath94/bpe-tokenizer

BPE and WordPiece tokenization from scratch — clean implementations that...

22
Experimental
7 DHRUVCHARNE/bpe-tokenizer-ts

From-scratch Byte Pair Encoding (BPE) tokenizer in TypeScript using Bun

21
Experimental
8 C4AI/token-counter

Python library + CLI to count dataset tokens with HF tokenizers and export...

21
Experimental
9 Catmono/bpe-tokenizer-ts

🧠 Build and explore a minimal Byte Pair Encoding tokenizer in TypeScript,...

21
Experimental
10 unixpickle/tweetenc

An auto-encoder for tweets

16
Experimental
11 toprakdeviren/gpu-bpe

GPU-accelerated Byte Pair Encoding in the browser via WebGPU compute shaders

13
Experimental
12 sumony2j/Simple-BPE-Tokenizer

A pure Python implementation of Byte Pair Encoding (BPE) tokenizer. Train on...

13
Experimental
13 mridulsaklani/My_Tokenizer

It is a small model of tokenizer also used by every AI GPT's model to...

11
Experimental
14 b0o/tiktoken-bench

A small Node.js benchmark suite for the tiktoken WASM port.

11
Experimental
15 pointlander/txt

A natural language model based on context mixing

11
Experimental