arbox/tokenizer

A simple tokenizer in Ruby for NLP tasks.

/ 100

Emerging

This tool helps linguists and language technology practitioners break down written text into individual words and sentences for analysis. It takes raw German, English, or Dutch text and outputs a structured list of tokens (words, punctuation) that can be used for further linguistic processing. Anyone involved in natural language processing or computational linguistics can use this for text preparation.

No commits in the last 6 months.

Use this if you need to precisely segment text into its constituent linguistic units (sentences and words) for tasks like sentiment analysis, machine translation, or text classification.

Not ideal if you need advanced linguistic features beyond basic tokenization, as some features are still under development.

linguistics natural-language-processing text-analysis computational-linguistics language-technology

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 8 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

Forks

Language

Ruby

License

—

Higher-rated alternatives

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...

daac-tools/vaporetto

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

Explore NLP Tools

All categories Trending NLP directory Insights