nihaljn/datahawk
Viewer for text datasets in formats like HuggingFace, JSONL, etc.
This tool helps researchers, data scientists, and analysts quickly review and understand large text datasets stored in formats like HuggingFace or JSONL. It allows you to load and explore raw text, code snippets, and associated metadata without downloading the entire dataset, helping you filter and sort to find patterns and insights. It's designed for anyone working with textual data who needs to browse and analyze it efficiently.
No commits in the last 6 months. Available on PyPI.
Use this if you need to visually explore and filter extensive text datasets, especially those containing code, to identify specific examples or understand overall content trends without memory constraints.
Not ideal if you need a full suite of machine learning model training or complex statistical analysis capabilities, as this tool focuses purely on data viewing and initial exploration.
Stars
15
Forks
1
Language
Python
License
MIT
Category
Last pushed
Feb 25, 2025
Commits (30d)
0
Dependencies
3
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/nihaljn/datahawk"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ryanjgallagher/shifterator
Interpretable data visualizations for understanding how texts differ at the word level
HLasse/TextDescriptives
A Python library for calculating a large variety of metrics from text
jboynyc/textnets
Text analysis with networks.
DemetersSon83/Quantitative-Discursive-Analysis
A tool for quantitatively measuring discursive similarity between bodies of text.
sciknoworg/tib-sid
TIB-SID: A bilingual (English/German) dataset of library catalog records with GND subject...