microsoft/Clandestino

Repository for the Clandestino corpus

37
/ 100
Emerging

Clandestino provides a Spanish toxic language dataset to help train AI models to identify harmful speech across different Spanish-speaking regions. It takes raw text in Spanish and provides labels indicating various forms of toxicity, accounting for regional nuances and informal spellings. This resource is for data scientists, machine learning engineers, and researchers building or evaluating content moderation systems for Spanish language platforms.

No commits in the last 6 months.

Use this if you need a diverse, locale-aware dataset to improve the accuracy of Spanish toxic language detection in AI models, especially for a global Spanish-speaking audience.

Not ideal if your project requires an exhaustive dataset of all possible problematic language, as it may not capture every nuance, or if you need a dataset for languages other than Spanish.

content-moderation natural-language-processing hispanic-studies social-media-analysis ai-safety
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 5 / 25
Maturity 16 / 25
Community 14 / 25

How are scores calculated?

Stars

10

Forks

3

Language

License

MIT

Last pushed

Jul 02, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/microsoft/Clandestino"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.