Tokenizer Libraries Transformer Models

Libraries and implementations for tokenization across programming languages and frameworks. Includes tokenizer training, conversion, alignment, and optimization. Does NOT include higher-level NLP tasks, token classification, or downstream language model applications.

There are 18 tokenizer libraries models tracked. 1 score above 70 (verified tier). The highest-rated is huggingface/tokenizers at 90/100 with 10,520 stars and 1,504,044 monthly downloads. 1 of the top 10 are actively maintained.

Get all 18 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=tokenizer-libraries&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 huggingface/tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

90
Verified
2 megagonlabs/ginza-transformers

Use custom tokenizers in spacy-transformers

46
Emerging
3 Kaleidophon/token2index

A lightweight but powerful library to build token indices for NLP tasks,...

45
Emerging
4 Hugging-Face-Supporter/tftokenizers

Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels

45
Emerging
5 NVIDIA/Cosmos-Tokenizer

A suite of image and video neural tokenizers

42
Emerging
6 wangcongcong123/ttt

A package for fine-tuning Transformers with TPUs, written in Tensorflow2.0+

35
Emerging
7 nlpodyssey/gotokenizers

Go implementation of today's most used tokenizers

35
Emerging
8 Beomi/megatronlm_dataset_autotokenizer

Megatron-LM/GPT-NeoX compatible Text Encoder with 🤗Transformers AutoTokenizer.

30
Emerging
9 technion-cs-nlp/BiologicalTokenizers

Effect of tokenization on transformers for biological sequence

28
Experimental
10 dnbaker/bioseq

Tokenizers and Machine Learning Models for biological sequence data

28
Experimental
11 symanto-research/merge-tokenizers

Package to align tokens from different tokenizations.

22
Experimental
12 mazebrr/language-tokenizer

🧩 Tokenize text efficiently across multiple languages using our robust...

21
Experimental
13 muna-ai/libtokenizers

C/C++ bindings from Huggingface Tokenizers.

21
Experimental
14 Mecanik/Tiny-BPE-Trainer

Lightweight, header-only Byte Pair Encoding (BPE) trainer in modern C++17....

20
Experimental
15 hikmatazimzade/azerbaijani-tokenizer

High-Performance Azerbaijani Tokenizers (30% fewer tokens, 40% faster than...

20
Experimental
16 JaydenTeoh/beyond-next-token-prediction

Curated collection of research on the limitations of next-token prediction...

19
Experimental
17 Systemcluster/tokenizer

General tokenizer library for the Web and Node. Supports Huggingface and...

18
Experimental
18 pegasus-lynx/mwe-bpe

BPE beyond Word Boundary: How NOT to use Multi‑Word Expressions in NMT

11
Experimental