microsoft/Clandestino

Repository for the Clandestino corpus

/ 100

Emerging

Clandestino provides a Spanish toxic language dataset to help train AI models to identify harmful speech across different Spanish-speaking regions. It takes raw text in Spanish and provides labels indicating various forms of toxicity, accounting for regional nuances and informal spellings. This resource is for data scientists, machine learning engineers, and researchers building or evaluating content moderation systems for Spanish language platforms.

No commits in the last 6 months.

Use this if you need a diverse, locale-aware dataset to improve the accuracy of Spanish toxic language detection in AI models, especially for a global Spanish-speaking audience.

Not ideal if your project requires an exhaustive dataset of all possible problematic language, as it may not capture every nuance, or if you need a dataset for languages other than Spanish.

content-moderation natural-language-processing hispanic-studies social-media-analysis ai-safety

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 14 / 25

How are scores calculated?

Stars

Forks

Language

—

License

MIT

Higher-rated alternatives

Helsinki-NLP/OpusFilter

OpusFilter - Parallel corpus processing toolkit

natasha/corus

Links to Russian corpora + Python functions for loading and parsing

darija-open-dataset/dataset

darija <-> english dataset

omicsNLP/Auto-CORPus

Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...

SergeyShk/ruTS

Библиотека для извлечения статистик из текстов на русском языке.

Explore NLP Tools

All categories Trending NLP directory Insights