skorani/tokenizer
An open source High level Persian Tokenizer
If you're working with Persian text, this tool helps you break down sentences or documents into meaningful units, or "tokens." This is crucial for tasks like text analysis or search, taking raw Persian text and outputting a list of its constituent semantic tokens. This is designed for anyone needing to process Persian language data.
No commits in the last 6 months.
Use this if you need to prepare Persian text for any kind of computational analysis or natural language processing.
Not ideal if your Persian text is not already cleaned of punctuation and extra spaces, as this tool requires normalized input.
Stars
10
Forks
2
Language
Jupyter Notebook
License
MIT
Category
Last pushed
Feb 20, 2020
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/skorani/tokenizer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
soaxelbrooke/python-bpe
Byte Pair Encoding for Python!