s-nlp/parallel_detoxification_dataset

Data from "Crowdsourcing of Parallel Corpora: the Case of Style Transfer for Detoxification" paper

23
/ 100
Experimental

This dataset helps content moderators, online community managers, and social media platforms automatically identify and rephrase toxic online comments into civil language. It provides pairs of original toxic sentences and their human-generated, detoxified counterparts. The data can be used to train AI models that can automatically transform harmful user-generated content into acceptable text.

No commits in the last 6 months.

Use this if you need to build or evaluate systems that automatically detect and rewrite toxic user comments into neutral, polite versions.

Not ideal if you're looking for a tool that performs the detoxification directly, as this is a dataset for training models, not a ready-to-use application.

content-moderation online-safety natural-language-processing community-management social-media-management
No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 5 / 25
Maturity 8 / 25
Community 10 / 25

How are scores calculated?

Stars

14

Forks

2

Language

License

Last pushed

Apr 03, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/s-nlp/parallel_detoxification_dataset"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.