daac-tools/python-vaporetto
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
This tool breaks down Japanese text into individual words or meaningful units, similar to how we separate English sentences by spaces. It takes a block of Japanese text as input and outputs a list of tokens, optionally with their part-of-speech tags and pronunciations. It's designed for natural language processing engineers or researchers working with Japanese text analysis.
No commits in the last 6 months. Available on PyPI.
Use this if you need a fast and lightweight way to segment Japanese sentences into words for tasks like text mining, sentiment analysis, or machine translation.
Not ideal if you're not a developer and are looking for a ready-to-use application with a graphical interface for Japanese text segmentation.
Stars
21
Forks
1
Language
Rust
License
Apache-2.0
Category
Last pushed
Jun 01, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/daac-tools/python-vaporetto"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer