natasha/razdel

Rule-based token, sentence segmentation for Russian language

/ 100

Established

This tool helps anyone working with Russian text to break down sentences into individual words or punctuation marks, and longer texts into separate sentences. You provide raw Russian text, and it returns a list of its constituent parts. It's ideal for linguists, researchers, or data analysts processing large volumes of Russian language content.

279 stars. Used by 4 other packages. No commits in the last 6 months. Available on PyPI.

Use this if you need to accurately split Russian news articles, fiction, or similar formal texts into words and sentences for further analysis.

Not ideal if your Russian text comes from social media, scientific papers, or legal documents, as its rules are optimized for news and fiction.

Russian-language-processing text-analysis linguistics data-preparation NLP

Stale 6m

Maintenance 0 / 25

Adoption 14 / 25

Maturity 25 / 25

Community 17 / 25

How are scores calculated?

Stars

279

Forks

Language

Python

License

MIT

Related tools

EmilStenstrom/conllu

A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.

OpenPecha/Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python

taishi-i/nagisa

A Japanese tokenizer based on recurrent neural networks

zaemyung/sentsplit

A flexible sentence segmentation library using CRF model and regex rules

polm/cutlet

Japanese to romaji converter in Python

Explore NLP Tools

All categories Trending NLP directory Insights