IAAR-Shanghai/xVerify
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
This tool helps researchers, educators, and evaluators quickly and accurately assess the correctness of answers generated by AI reasoning models. It takes the original question, the known correct answer, and the AI's generated reasoning process and final answer as input. It then determines if the AI's answer is correct, even when the formatting or language differs, outputting a judgment of 'Correct' or 'Incorrect'. This is ideal for anyone who needs to systematically evaluate the performance of large language models on objective tasks.
144 stars.
Use this if you need to reliably evaluate the accuracy of AI-generated answers for objective questions, especially when responses include complex reasoning, various mathematical notations, or natural language variations.
Not ideal if your questions are open-ended, subjective, or require nuanced human judgment beyond clear-cut objective correctness.
Stars
144
Forks
7
Language
Jupyter Notebook
License
—
Category
Last pushed
Nov 13, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/IAAR-Shanghai/xVerify"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
MMMU-Benchmark/MMMU
This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal...
pat-jj/DeepRetrieval
[COLM’25] DeepRetrieval — 🔥 Training Search Agent by RLVR with Retrieval Outcome
lupantech/MathVista
MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts
x66ccff/liveideabench
[𝐍𝐚𝐭𝐮𝐫𝐞 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬] 🤖💡 LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea...
ise-uiuc/magicoder
[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct