JuliaText/WordTokenizers.jl
High performance tokenizers for natural language processing and other related tasks
This project helps you break down raw text into meaningful units like individual words and sentences, which is the first step for any text analysis. It takes a block of text as input and outputs a structured list of words or sentences. Anyone working with text data for research, content analysis, or language processing would find this useful.
100 stars. No commits in the last 6 months.
Use this if you need to precisely segment text into words or sentences for further analysis, especially if you're working with diverse languages or specific text formats like social media posts.
Not ideal if you're looking for a complete natural language understanding solution, as this tool focuses solely on text segmentation and not on deeper linguistic analysis like part-of-speech tagging or sentiment analysis.
Stars
100
Forks
25
Language
Julia
License
—
Category
Last pushed
Dec 30, 2021
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/JuliaText/WordTokenizers.jl"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer