Joinn99/RocketEval-ICLR

🚀 [ICLR '25] RocketEval: Efficient Automated LLM Evaluation via Grading Checklist

/ 100

Emerging

Quickly and automatically assess how well different large language models (LLMs) respond to your specific prompts. You provide a list of questions or prompts and the responses from various LLMs, and this tool generates a detailed grading checklist and scores each response, providing you with a ranking of the models. This is ideal for AI researchers or developers who need to systematically compare and select the best-performing LLMs for their applications.

No commits in the last 6 months.

Use this if you need an efficient, automated way to evaluate the quality of multiple LLM responses against a set of criteria without extensive manual review.

Not ideal if you only need to evaluate a single LLM or if your evaluation criteria are too nuanced for checklist-based grading.

LLM evaluation AI model benchmarking Generative AI Natural Language Processing AI research

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 6 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

google/langfun

OO for LLMs

tanaos/artifex

Small Language Model Inference, Fine-Tuning and Observability. No GPU, no labeled data needed.

preligens-lab/textnoisr

Adding random noise to a text dataset, and controlling very accurately the quality of the result

vulnerability-lookup/VulnTrain

A tool to generate datasets and models based on vulnerabilities descriptions from @Vulnerability-Lookup.

masakhane-io/masakhane-mt

Machine Translation for Africa

Explore NLP Tools

All categories Trending NLP directory Insights