Bpe Tokenizers NLP Tools

There are 12 bpe tokenizers tools tracked. The highest-rated is georg-jung/FastBertTokenizer at 47/100 with 53 stars.

Get all 12 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=bpe-tokenizers&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 georg-jung/FastBertTokenizer

Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.

47
Emerging
2 ml-rust/splintr

A high-performance tokenizer (BPE + SentencePiece) built with Rust with...

45
Emerging
3 sanderland/script_tok

Code for the paper "BPE stays on SCRIPT"

44
Emerging
4 ash-01xor/bpe.c

Simple Byte pair Encoding mechanism used for tokenization process . written...

32
Emerging
5 U4RASD/r-bpe

R-BPE: Improving BPE-Tokenizers with Token Reuse

30
Emerging
6 jmaczan/bpe-tokenizer

Byte-Pair Encoding tokenizer for training large language models on huge datasets

30
Emerging
7 vforteli/WordPieceTokenizer

WordPiece tokenizer for dotnet (eg with ML.Net)

29
Experimental
8 deepanprabhu/fastbpe

Java library implementing Byte-Pair Encoding Tokenization

21
Experimental
9 BlackNinjaKR/BPE_BytePairEncoding

An implementation of Byte Pair Encoding (BPE)

20
Experimental
10 jmaczan/bpe.c

High performance Byte-Pair Encoding tokenizer for large language models

19
Experimental
11 swanshiv/varna_marathi_tokenizer

From-scratch Marathi BPE tokenizer with Flask API and web interface for...

13
Experimental
12 burcgokden/Sentencepiece-Tokenizer-Wrapper-for-PLDR-LLM

A framework for building Sentencepiece tokenizer from a dataset

11
Experimental