onesuper/HuggingFace-Datasets-Text-Quality-Analysis
Retrieves parquet files from Hugging Face, identifies and quantifies junky data, duplication, contamination, and biased content in dataset using pandas
This tool helps you quickly assess the quality of text datasets available on Hugging Face. It takes a Hugging Face dataset as input and generates an analysis report detailing junk data, duplicates, contamination, and biased content. Anyone preparing a text dataset for machine learning model training would find this useful.
No commits in the last 6 months.
Use this if you need to quickly understand the inherent quality issues in a Hugging Face text dataset before using it for your machine learning project.
Not ideal if you are working with extremely large datasets on a standard machine, as it may encounter memory issues.
Stars
53
Forks
3
Language
Python
License
—
Category
Last pushed
Jul 06, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/onesuper/HuggingFace-Datasets-Text-Quality-Analysis"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.