YangLinyi/GLUE-X

We leverage 14 datasets as OOD test data and conduct evaluations on 8 NLU tasks over 21 popularly used models. Our findings confirm that the OOD accuracy in NLP tasks needs to be paid more attention to since the significant performance decay compared to ID accuracy has been found in all settings.

21
/ 100
Experimental

This project helps machine learning engineers and NLP researchers evaluate the robustness of their natural language understanding models. It takes existing models and tests them against 14 diverse datasets that represent 'out-of-domain' scenarios. The output reveals how well a model generalizes to new, unseen text, highlighting potential performance drops compared to in-domain accuracy.

No commits in the last 6 months.

Use this if you are a machine learning engineer or NLP researcher concerned about your model's real-world reliability and generalization beyond its original training data.

Not ideal if you are looking for a tool to train or fine-tune a language model, as this project focuses solely on out-of-distribution evaluation.

Natural Language Processing Machine Learning Evaluation Model Robustness Out-of-Distribution Generalization NLP Research
No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 9 / 25
Maturity 8 / 25
Community 4 / 25

How are scores calculated?

Stars

93

Forks

2

Language

Python

License

Last pushed

Aug 15, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/YangLinyi/GLUE-X"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.