s-nlp/parallel_detoxification_dataset
Data from "Crowdsourcing of Parallel Corpora: the Case of Style Transfer for Detoxification" paper
This dataset helps content moderators, online community managers, and social media platforms automatically identify and rephrase toxic online comments into civil language. It provides pairs of original toxic sentences and their human-generated, detoxified counterparts. The data can be used to train AI models that can automatically transform harmful user-generated content into acceptable text.
No commits in the last 6 months.
Use this if you need to build or evaluate systems that automatically detect and rewrite toxic user comments into neutral, polite versions.
Not ideal if you're looking for a tool that performs the detoxification directly, as this is a dataset for training models, not a ready-to-use application.
Stars
14
Forks
2
Language
—
License
—
Category
Last pushed
Apr 03, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/s-nlp/parallel_detoxification_dataset"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
unitaryai/detoxify
Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built...
kensk8er/chicksexer
A Python package for gender classification.
Infinitode/ValX
ValX is an open-source Python package for text cleaning tasks, including profanity detection and...
PavelOstyakov/toxic
Toxic Comment Classification Challenge
minerva-ml/open-solution-toxic-comments
Open solution to the Toxic Comment Classification Challenge