waltonfuture/Diff-eRank

[NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models

/ 100

Emerging

This project offers a new way to evaluate Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs). It takes the internal data representations of a trained LLM and an untrained version of the same model. The output is a "Diff-eRank" score, which helps you understand how efficiently the model has learned to discard redundant information. It's for researchers, data scientists, or AI evaluators who need to assess the quality and efficiency of LLMs and MLLMs.

No commits in the last 6 months.

Use this if you need an alternative, information-theory-based metric to quantify how well an LLM or MLLM processes and compresses information during training, especially when traditional metrics like loss and accuracy don't fully capture what you need.

Not ideal if you are looking for metrics related to the model's external performance on specific tasks, like response quality or factual accuracy, rather than its internal representational efficiency.

LLM-evaluation AI-model-assessment natural-language-processing-research machine-learning-engineering multi-modal-AI

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 8 / 25

Maturity 16 / 25

Community 5 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

eth-sri/matharena

Evaluation of LLMs on latest math competitions

tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality,...

HPAI-BSC/TuRTLe

TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

haesleinhuepf/human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

Explore Transformer Models

All categories Trending Transformer directory Insights