daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
Vaporetto helps Japanese language researchers and computational linguists accurately break down raw Japanese text into individual words or meaningful units. You feed it unsegmented Japanese sentences, and it outputs the text with spaces inserted between the identified tokens, enabling further linguistic analysis. This tool is for those who need precise Japanese word segmentation for natural language processing tasks.
254 stars and 13,274 monthly downloads.
Use this if you need to quickly and accurately segment Japanese text into words, either by using pre-trained models or by training your own custom models based on specific linguistic data.
Not ideal if you are working with languages other than Japanese, or if your primary need is general-purpose text processing rather than specific linguistic segmentation.
Stars
254
Forks
10
Language
Rust
License
Apache-2.0
Category
Last pushed
Feb 07, 2026
Monthly downloads
13,274
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/daac-tools/vaporetto"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
soaxelbrooke/python-bpe
Byte Pair Encoding for Python!