SeekingDream/Static-to-Dynamic-LLMEval

The official GitHub repository of the paper "Recent advances in large language model benchmarks against data contamination: From static to dynamic evaluation"

/ 100

Emerging

This survey helps AI researchers and practitioners understand and mitigate 'data contamination' when evaluating large language models (LLMs). It provides a comprehensive analysis of existing static and dynamic benchmarking methods designed to prevent inflated performance scores. The output is a clear guide and proposed design principles for creating more reliable LLM evaluations.

547 stars.

Use this if you are developing or evaluating large language models and need to ensure your benchmark results accurately reflect the model's capabilities without bias from contaminated training data.

Not ideal if you are looking for an off-the-shelf tool to directly run benchmarks; this project is a research survey providing insights and guidelines rather than executable code for immediate evaluation.

LLM evaluation AI model benchmarking AI research Data integrity Machine learning ethics

No License No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 15 / 25

How are scores calculated?

Stars

547

Forks

Language

—

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

EuroEval/EuroEval

The robust European language model benchmark.

Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

Explore LLM Tools

All categories Trending LLM Tool directory Insights