daac-tools/python-vibrato
Viterbi-based accelerated tokenizer (Python wrapper)
This tool helps analyze Japanese text by breaking it down into individual words or meaningful units, a process called tokenization or morphological analysis. You provide raw Japanese text, and it outputs a list of tokens, each with its surface form and grammatical features. It's designed for natural language processing engineers, computational linguists, or data scientists working with Japanese text data.
No commits in the last 6 months.
Use this if you need to quickly and accurately segment Japanese sentences into their constituent words and understand their grammatical roles for tasks like text analysis, search, or machine translation.
Not ideal if you're working with languages other than Japanese, or if you do not have pre-trained tokenization models ready to use.
Stars
43
Forks
1
Language
Rust
License
Apache-2.0
Category
Last pushed
Sep 04, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/daac-tools/python-vibrato"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer