Mecanik/Modern-Text-Tokenizer

Modern UTF-8 aware C++ tokenizer with vocabulary support, ideal for NLP and transformer models. Header-only and zero-dependency.

22
/ 100
Experimental

This C++ library helps developers process raw text into structured tokens and numerical IDs, essential for building modern Natural Language Processing (NLP) applications. It takes raw text in various languages (including those with Unicode characters) and a custom or pre-existing vocabulary file, then outputs sequences of tokens or numerical IDs ready for machine learning models. Machine learning engineers and NLP practitioners working with C++ will find this useful for preparing text data for models like BERT or DistilBERT.

No commits in the last 6 months.

Use this if you are a C++ developer or ML engineer needing a fast, dependency-free text tokenizer for NLP model pre-processing.

Not ideal if you primarily work in Python, prefer high-level APIs like HuggingFace Tokenizers, or need a full-fledged NLP pipeline beyond just tokenization.

natural-language-processing machine-learning-engineering text-pre-processing computational-linguistics
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 5 / 25
Maturity 15 / 25
Community 0 / 25

How are scores calculated?

Stars

12

Forks

Language

C++

License

MIT

Last pushed

Aug 07, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/Mecanik/Modern-Text-Tokenizer"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.