nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
This evaluation framework helps AI researchers and developers assess how well their multilingual large language models (LLMs) understand and answer questions across different languages. You input an LLM and it outputs performance scores on specific question-answering tasks (ARC, HellaSwag, MMLU) in 26 languages, showing how effectively your model generalizes beyond English. This is for researchers building or fine-tuning LLMs for global applications.
132 stars. No commits in the last 6 months.
Use this if you are developing or fine-tuning large language models and need to rigorously benchmark their comprehension and reasoning abilities across a wide range of non-English languages.
Not ideal if you are a general user looking to apply an existing LLM for a specific task, rather than evaluating the model's underlying linguistic capabilities.
Stars
132
Forks
18
Language
Python
License
Apache-2.0
Category
Last pushed
Aug 21, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/nlp-uoregon/mlmm-evaluation"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
eth-sri/matharena
Evaluation of LLMs on latest math competitions
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality,...
HPAI-BSC/TuRTLe
TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)
haesleinhuepf/human-eval-bia
Benchmarking Large Language Models for Bio-Image Analysis Code Generation
ShuntaroOkuma/adapt-gauge-core
Measure LLM adaptation efficiency — how fast models learn from few examples