maxoodf/russian_news_corpus
Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
This corpus provides a collection of approximately 1.5 million news articles from 27 top Russian online media sources, covering April 2016 to March 2017. It offers the articles in a 'stemmed' or morphologically normalized format, making them ready for text analysis. Researchers, linguists, or data scientists studying Russian language trends, media content, or developing language models would find this useful.
No commits in the last 6 months.
Use this if you need a large, pre-processed dataset of Russian news articles for linguistic research, natural language processing projects, or training machine learning models.
Not ideal if you require a live feed of news, an un-stemmed corpus, or data beyond the specified 2016-2017 timeframe.
Stars
93
Forks
8
Language
—
License
Apache-2.0
Category
Last pushed
Apr 04, 2017
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/maxoodf/russian_news_corpus"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
natasha/corus
Links to Russian corpora + Python functions for loading and parsing
SergeyShk/ruTS
Библиотека для извлечения статистик из текстов на русском языке.
darija-open-dataset/dataset
darija <-> english dataset
omicsNLP/Auto-CORPus
Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...