somosnlp/corpus-es

Lista de corpus de PLN en español ✨ #Somos600M: Ayuda a desarrollar IA inclusiva que entienda las diferentes variedades de nuestras lenguas ✨ English-speaking contributors welcome!

/ 100

Emerging

This project helps natural language processing (NLP) practitioners, researchers, and AI developers build more inclusive AI models that accurately understand and speak the diverse varieties of Spanish spoken by 600 million people. It takes various forms of Spanish text, audio, and image datasets, and provides a centralized, growing collection that fuels the development of advanced Spanish NLP applications. Anyone involved in creating or improving AI that interacts with Spanish speakers would benefit.

No commits in the last 6 months.

Use this if you are a data scientist, AI researcher, or machine learning engineer working on natural language processing for the Spanish language and need diverse, high-quality datasets covering different regional dialects, registers, and domains.

Not ideal if you are looking for ready-to-use, pre-trained models or a tool for immediate application of NLP without building or fine-tuning models.

Spanish-language AI NLP dataset curation linguistic diversity machine learning data AI model training

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 8 / 25

Community 15 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

Helsinki-NLP/OpusFilter

OpusFilter - Parallel corpus processing toolkit

natasha/corus

Links to Russian corpora + Python functions for loading and parsing

SergeyShk/ruTS

Библиотека для извлечения статистик из текстов на русском языке.

darija-open-dataset/dataset

darija <-> english dataset

omicsNLP/Auto-CORPus

Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...

Explore NLP Tools

All categories Trending NLP directory Insights