onejune2018/Awesome-LLM-Eval

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.

48
/ 100
Emerging

This project helps AI researchers, machine learning engineers, and data scientists understand and benchmark the performance of large language models (LLMs). It brings together a comprehensive list of tools, datasets, benchmarks, and leaderboards. Practitioners can use this to identify relevant resources for evaluating LLMs, whether for general capabilities, domain-specific tasks, or specific attributes like inference speed or factuality.

616 stars.

Use this if you need a centralized, up-to-date resource to find tools, datasets, and methodologies for evaluating large language models across various criteria and domains.

Not ideal if you are looking for an off-the-shelf software library to run evaluations directly, as this is primarily a curated list of external resources.

LLM evaluation Generative AI research NLP benchmarking AI model assessment Machine learning engineering
No Package No Dependents
Maintenance 6 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 16 / 25

How are scores calculated?

Stars

616

Forks

51

Language

License

MIT

Last pushed

Nov 24, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/onejune2018/Awesome-LLM-Eval"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.