ChenghaoMou/text-dedup

All-in-one text de-duplication

/ 100

Established

This tool helps content creators, data analysts, and researchers clean large collections of text data by identifying and removing duplicate or very similar entries. You provide your text dataset (like articles, social media posts, or documents), and it outputs a cleaned version with unique entries, improving data quality for tasks like training language models or competitive analysis. It's designed for anyone working with substantial amounts of text who needs to ensure data originality.

746 stars. Available on PyPI.

Use this if you need to eliminate exact or near-duplicate text entries from large datasets, such as news articles, academic papers, or user-generated content.

Not ideal if you're dealing with small text batches or if your primary goal is nuanced semantic analysis rather than direct content similarity.

data-cleaning content-moderation text-analysis research-data-preparation

Maintenance 10 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 18 / 25

How are scores calculated?

Stars

746

Forks

Language

Python

License

Apache-2.0

Related tools

loretoparisi/fasttext.js

FastText for Node.js

messense/fasttext-serving

fastText model serving service

gagan3012/PolyDeDupe

PolyDeDupe: Multi-Lingual Data Deduplication

vrasneur/pyfasttext

Yet another Python binding for fastText

olegtarasov/FastText.NetWrapper

.NET Standard wrapper for fastText library. Now works on Windows, Linux and MacOs!

Explore NLP Tools

All categories Trending NLP directory Insights