Mecanik/Modern-Text-Tokenizer
Modern UTF-8 aware C++ tokenizer with vocabulary support, ideal for NLP and transformer models. Header-only and zero-dependency.
This C++ library helps developers process raw text into structured tokens and numerical IDs, essential for building modern Natural Language Processing (NLP) applications. It takes raw text in various languages (including those with Unicode characters) and a custom or pre-existing vocabulary file, then outputs sequences of tokens or numerical IDs ready for machine learning models. Machine learning engineers and NLP practitioners working with C++ will find this useful for preparing text data for models like BERT or DistilBERT.
No commits in the last 6 months.
Use this if you are a C++ developer or ML engineer needing a fast, dependency-free text tokenizer for NLP model pre-processing.
Not ideal if you primarily work in Python, prefer high-level APIs like HuggingFace Tokenizers, or need a full-fledged NLP pipeline beyond just tokenization.
Stars
12
Forks
—
Language
C++
License
MIT
Category
Last pushed
Aug 07, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/Mecanik/Modern-Text-Tokenizer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
guillaume-be/rust-tokenizers
Rust-tokenizer offers high-performance tokenizers for modern language models, including...
sugarme/tokenizer
NLP tokenizers written in Go language
elixir-nx/tokenizers
Elixir bindings for 🤗 Tokenizers
openscilab/tocount
ToCount: Lightweight Token Estimator
reinfer/blingfire-rs
Rust wrapper for the BlingFire tokenization library