natasha/nerus

Large silver standart Russian corpus with NER, morphology and syntax markup

/ 100

Emerging

This project provides a large dataset of Russian news articles from Lenta.ru, meticulously annotated with linguistic information. It takes raw Russian text and outputs detailed markup including parts of speech, syntax, and recognized entities like people, locations, and organizations. Researchers and language model developers working with Russian text would find this dataset valuable for training and evaluating natural language processing systems.

No commits in the last 6 months. Available on PyPI.

Use this if you need a pre-annotated, large-scale Russian text corpus for developing or testing linguistic analysis tools, especially for named entity recognition, morphology, and syntax.

Not ideal if you're looking for a simple tool to analyze individual Russian texts without needing a large dataset for model training or evaluation.

Russian Language Processing Natural Language Understanding Corpus Linguistics Named Entity Recognition Morphological Analysis

Stale 6m No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 25 / 25

Community 14 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

Helsinki-NLP/OpusFilter

OpusFilter - Parallel corpus processing toolkit

natasha/corus

Links to Russian corpora + Python functions for loading and parsing

darija-open-dataset/dataset

darija <-> english dataset

omicsNLP/Auto-CORPus

Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...

SergeyShk/ruTS

Библиотека для извлечения статистик из текстов на русском языке.

Explore NLP Tools

All categories Trending NLP directory Insights