taishi-i/toiro
A tool for comparing tokenizers
When working with Japanese text for natural language processing, this tool helps you evaluate and select the best 'tokenizer' for your specific needs. It takes raw Japanese text and various tokenization methods as input, then shows you how each one breaks the text into words, their processing speed, and their performance on tasks like text classification. This is invaluable for data scientists or NLP engineers focusing on Japanese language processing.
121 stars. Available on PyPI.
Use this if you need to choose the most suitable Japanese tokenizer for a new project and want to compare options based on speed, segmentation accuracy, or downstream application performance.
Not ideal if your work exclusively involves languages other than Japanese, as its primary focus and features are tailored for Japanese text processing.
Stars
121
Forks
9
Language
Python
License
Apache-2.0
Category
Last pushed
Nov 09, 2025
Commits (30d)
0
Dependencies
6
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/taishi-i/toiro"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
soaxelbrooke/python-bpe
Byte Pair Encoding for Python!