JuliaText/WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks

/ 100

Emerging

This project helps you break down raw text into meaningful units like individual words and sentences, which is the first step for any text analysis. It takes a block of text as input and outputs a structured list of words or sentences. Anyone working with text data for research, content analysis, or language processing would find this useful.

100 stars. No commits in the last 6 months.

Use this if you need to precisely segment text into words or sentences for further analysis, especially if you're working with diverse languages or specific text formats like social media posts.

Not ideal if you're looking for a complete natural language understanding solution, as this tool focuses solely on text segmentation and not on deeper linguistic analysis like part-of-speech tagging or sentiment analysis.

text-processing natural-language-analysis content-preparation language-research data-preprocessing

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 20 / 25

How are scores calculated?

Stars

100

Forks

Language

Julia

License

—

Higher-rated alternatives

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...

daac-tools/vaporetto

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

Explore NLP Tools

All categories Trending NLP directory Insights