skorani/tokenizer

An open source High level Persian Tokenizer

/ 100

Emerging

If you're working with Persian text, this tool helps you break down sentences or documents into meaningful units, or "tokens." This is crucial for tasks like text analysis or search, taking raw Persian text and outputting a list of its constituent semantic tokens. This is designed for anyone needing to process Persian language data.

No commits in the last 6 months.

Use this if you need to prepare Persian text for any kind of computational analysis or natural language processing.

Not ideal if your Persian text is not already cleaned of punctuation and extra spaces, as this tool requires normalized input.

Persian-text-analysis Natural-Language-Processing Text-mining Data-preparation

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

MIT

Higher-rated alternatives

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...

soaxelbrooke/python-bpe

Byte Pair Encoding for Python!

Explore NLP Tools

All categories Trending NLP directory Insights