Mecanik/Modern-Text-Tokenizer

Modern UTF-8 aware C++ tokenizer with vocabulary support, ideal for NLP and transformer models. Header-only and zero-dependency.

/ 100

Experimental

This C++ library helps developers process raw text into structured tokens and numerical IDs, essential for building modern Natural Language Processing (NLP) applications. It takes raw text in various languages (including those with Unicode characters) and a custom or pre-existing vocabulary file, then outputs sequences of tokens or numerical IDs ready for machine learning models. Machine learning engineers and NLP practitioners working with C++ will find this useful for preparing text data for models like BERT or DistilBERT.

No commits in the last 6 months.

Use this if you are a C++ developer or ML engineer needing a fast, dependency-free text tokenizer for NLP model pre-processing.

Not ideal if you primarily work in Python, prefer high-level APIs like HuggingFace Tokenizers, or need a full-fledged NLP pipeline beyond just tokenization.

natural-language-processing machine-learning-engineering text-pre-processing computational-linguistics

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 5 / 25

Maturity 15 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

C++

License

MIT

Higher-rated alternatives

guillaume-be/rust-tokenizers

Rust-tokenizer offers high-performance tokenizers for modern language models, including...

sugarme/tokenizer

NLP tokenizers written in Go language

elixir-nx/tokenizers

Elixir bindings for 🤗 Tokenizers

openscilab/tocount

ToCount: Lightweight Token Estimator

reinfer/blingfire-rs

Rust wrapper for the BlingFire tokenization library

Explore NLP Tools

All categories Trending NLP directory Insights