Koziev/NLP_Datasets
My NLP datasets for Russian language
This project provides pre-collected and pre-processed Russian language datasets, primarily for developing conversational AI. It offers large collections of dialogues from various sources like imageboards, movie subtitles, and literature, along with paraphrased sentences and short sentence patterns. These datasets are ideal for developers, researchers, or data scientists working on Russian natural language processing models, especially for building chatbots or dialogue systems.
386 stars. No commits in the last 6 months.
Use this if you need extensive, ready-to-use Russian text data for training or evaluating conversational AI, text generation, or natural language understanding models.
Not ideal if you require datasets in languages other than Russian, or if your NLP task is highly specialized and requires domain-specific data not covered here.
Stars
386
Forks
55
Language
C#
License
CC0-1.0
Category
Last pushed
Feb 18, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/Koziev/NLP_Datasets"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
natasha/corus
Links to Russian corpora + Python functions for loading and parsing
SergeyShk/ruTS
Библиотека для извлечения статистик из текстов на русском языке.
darija-open-dataset/dataset
darija <-> english dataset
omicsNLP/Auto-CORPus
Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...