microsoft/Clandestino
Repository for the Clandestino corpus
Clandestino provides a Spanish toxic language dataset to help train AI models to identify harmful speech across different Spanish-speaking regions. It takes raw text in Spanish and provides labels indicating various forms of toxicity, accounting for regional nuances and informal spellings. This resource is for data scientists, machine learning engineers, and researchers building or evaluating content moderation systems for Spanish language platforms.
No commits in the last 6 months.
Use this if you need a diverse, locale-aware dataset to improve the accuracy of Spanish toxic language detection in AI models, especially for a global Spanish-speaking audience.
Not ideal if your project requires an exhaustive dataset of all possible problematic language, as it may not capture every nuance, or if you need a dataset for languages other than Spanish.
Stars
10
Forks
3
Language
—
License
MIT
Category
Last pushed
Jul 02, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/microsoft/Clandestino"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
natasha/corus
Links to Russian corpora + Python functions for loading and parsing
darija-open-dataset/dataset
darija <-> english dataset
omicsNLP/Auto-CORPus
Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...
SergeyShk/ruTS
Библиотека для извлечения статистик из текстов на русском языке.