ChenghaoMou/text-dedup

All-in-one text de-duplication

63
/ 100
Established

This tool helps content creators, data analysts, and researchers clean large collections of text data by identifying and removing duplicate or very similar entries. You provide your text dataset (like articles, social media posts, or documents), and it outputs a cleaned version with unique entries, improving data quality for tasks like training language models or competitive analysis. It's designed for anyone working with substantial amounts of text who needs to ensure data originality.

746 stars. Available on PyPI.

Use this if you need to eliminate exact or near-duplicate text entries from large datasets, such as news articles, academic papers, or user-generated content.

Not ideal if you're dealing with small text batches or if your primary goal is nuanced semantic analysis rather than direct content similarity.

data-cleaning content-moderation text-analysis research-data-preparation
Maintenance 10 / 25
Adoption 10 / 25
Maturity 25 / 25
Community 18 / 25

How are scores calculated?

Stars

746

Forks

75

Language

Python

License

Apache-2.0

Last pushed

Mar 09, 2026

Commits (30d)

0

Dependencies

15

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/ChenghaoMou/text-dedup"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.