SulRash/huggingface-text-data-analyzer
Analyzes text datasets from huggingface for training LLMs!
This tool helps AI developers and researchers understand the characteristics of text datasets from Hugging Face before using them to train large language models. It takes a dataset, optionally with a tokenizer, and outputs detailed reports on text length, word distribution, junk content, part-of-speech tags, named entities, language, and sentiment. This helps you quickly assess data quality and relevance for your specific model training goals.
No commits in the last 6 months. Available on PyPI.
Use this if you need to thoroughly inspect and profile a Hugging Face text dataset to ensure its suitability for training a large language model, or to identify areas for data cleaning and preprocessing.
Not ideal if you are looking for a general-purpose text analysis tool for small, non-Hugging Face datasets or for deep qualitative research that requires nuanced manual interpretation.
Stars
8
Forks
—
Language
Python
License
Apache-2.0
Category
Last pushed
Dec 06, 2024
Commits (30d)
0
Dependencies
9
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/SulRash/huggingface-text-data-analyzer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ryanjgallagher/shifterator
Interpretable data visualizations for understanding how texts differ at the word level
HLasse/TextDescriptives
A Python library for calculating a large variety of metrics from text
jboynyc/textnets
Text analysis with networks.
DemetersSon83/Quantitative-Discursive-Analysis
A tool for quantitatively measuring discursive similarity between bodies of text.
sciknoworg/tib-sid
TIB-SID: A bilingual (English/German) dataset of library catalog records with GND subject...