minnesotanlp/infoVerse
Jaehyung Kim et al's ACL 2023 paper on "infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information"
This tool helps machine learning engineers and researchers deeply understand and characterize their natural language processing (NLP) datasets. It takes your existing text datasets, processes them through various classifiers, and generates a 'meta-information' profile. This profile provides insights into dataset characteristics like complexity and diversity, which can then be used to inform decisions about data quality and model training.
No commits in the last 6 months.
Use this if you need to comprehensively analyze the properties of your NLP datasets to make informed decisions about data pruning, active learning strategies, or data annotation efforts.
Not ideal if you are looking for a simple data cleaning tool or if your primary goal is to train a model without needing deep insights into dataset characteristics.
Stars
16
Forks
1
Language
Python
License
MIT
Category
Last pushed
Jun 28, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/minnesotanlp/infoVerse"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ymcui/cmrc2018
A Span-Extraction Dataset for Chinese Machine Reading Comprehension (CMRC 2018)
princeton-nlp/DensePhrases
[ACL 2021] Learning Dense Representations of Phrases at Scale; EMNLP'2021: Phrase Retrieval...
thunlp/MultiRD
Code and data of the AAAI-20 paper "Multi-channel Reverse Dictionary Model"
IndexFziQ/KMRC-Papers
A list of recent papers regarding knowledge-based machine reading comprehension.
danqi/rc-cnn-dailymail
CNN/Daily Mail Reading Comprehension Task