minnesotanlp/infoVerse

Jaehyung Kim et al's ACL 2023 paper on "infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information"

/ 100

Experimental

This tool helps machine learning engineers and researchers deeply understand and characterize their natural language processing (NLP) datasets. It takes your existing text datasets, processes them through various classifiers, and generates a 'meta-information' profile. This profile provides insights into dataset characteristics like complexity and diversity, which can then be used to inform decisions about data quality and model training.

No commits in the last 6 months.

Use this if you need to comprehensively analyze the properties of your NLP datasets to make informed decisions about data pruning, active learning strategies, or data annotation efforts.

Not ideal if you are looking for a simple data cleaning tool or if your primary goal is to train a model without needing deep insights into dataset characteristics.

NLP dataset analysis Machine learning engineering Data quality assessment Text data characterization AI research

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 6 / 25

Maturity 16 / 25

Community 5 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

ymcui/cmrc2018

A Span-Extraction Dataset for Chinese Machine Reading Comprehension (CMRC 2018)

princeton-nlp/DensePhrases

[ACL 2021] Learning Dense Representations of Phrases at Scale; EMNLP'2021: Phrase Retrieval...

thunlp/MultiRD

Code and data of the AAAI-20 paper "Multi-channel Reverse Dictionary Model"

IndexFziQ/KMRC-Papers

A list of recent papers regarding knowledge-based machine reading comprehension.

danqi/rc-cnn-dailymail

CNN/Daily Mail Reading Comprehension Task

Explore NLP Tools

All categories Trending NLP directory Insights