QwenLM/PolyMath
[NeurIPS 2025 D&B Track] Evaluation Code Repo for Paper "PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts"
This project helps evaluate how well AI models can solve math problems across many languages and difficulty levels. You provide an AI model's mathematical problem-solving attempts in various languages, and it outputs a score reflecting the model's accuracy and consistency. It is designed for AI researchers and developers who are building or comparing large language models (LLMs) and need to rigorously test their mathematical reasoning abilities.
No commits in the last 6 months.
Use this if you are developing or benchmarking large language models and need a standardized, high-quality, multilingual dataset and evaluation framework to assess their mathematical reasoning skills.
Not ideal if you are looking for a tool to teach or learn mathematics, or if you only need to evaluate basic arithmetic skills in a single language.
Stars
42
Forks
7
Language
Python
License
—
Category
Last pushed
May 22, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/QwenLM/PolyMath"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ExtensityAI/symbolicai
A neurosymbolic perspective on LLMs
TIGER-AI-Lab/MMLU-Pro
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding...
deep-symbolic-mathematics/LLM-SR
[ICLR 2025 Oral] This is the official repo for the paper "LLM-SR" on Scientific Equation...
microsoft/interwhen
A framework for verifiable reasoning with language models.
zhudotexe/fanoutqa
Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language...