IAAR-Shanghai/xVerify

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

/ 100

Emerging

This tool helps researchers, educators, and evaluators quickly and accurately assess the correctness of answers generated by AI reasoning models. It takes the original question, the known correct answer, and the AI's generated reasoning process and final answer as input. It then determines if the AI's answer is correct, even when the formatting or language differs, outputting a judgment of 'Correct' or 'Incorrect'. This is ideal for anyone who needs to systematically evaluate the performance of large language models on objective tasks.

144 stars.

Use this if you need to reliably evaluate the accuracy of AI-generated answers for objective questions, especially when responses include complex reasoning, various mathematical notations, or natural language variations.

Not ideal if your questions are open-ended, subjective, or require nuanced human judgment beyond clear-cut objective correctness.

AI-model-evaluation answer-assessment educational-assessment mathematics-evaluation natural-language-processing

No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 8 / 25

How are scores calculated?

Stars

144

Forks

Language

Jupyter Notebook

License

—

Higher-rated alternatives

MMMU-Benchmark/MMMU

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal...

pat-jj/DeepRetrieval

[COLM’25] DeepRetrieval — 🔥 Training Search Agent by RLVR with Retrieval Outcome

lupantech/MathVista

MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts

x66ccff/liveideabench

[𝐍𝐚𝐭𝐮𝐫𝐞 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬] 🤖💡 LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea...

ise-uiuc/magicoder

[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct

Explore LLM Tools

All categories Trending LLM Tool directory Insights