QwenLM/PolyMath

[NeurIPS 2025 D&B Track] Evaluation Code Repo for Paper "PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts"

31
/ 100
Emerging

This project helps evaluate how well AI models can solve math problems across many languages and difficulty levels. You provide an AI model's mathematical problem-solving attempts in various languages, and it outputs a score reflecting the model's accuracy and consistency. It is designed for AI researchers and developers who are building or comparing large language models (LLMs) and need to rigorously test their mathematical reasoning abilities.

No commits in the last 6 months.

Use this if you are developing or benchmarking large language models and need a standardized, high-quality, multilingual dataset and evaluation framework to assess their mathematical reasoning skills.

Not ideal if you are looking for a tool to teach or learn mathematics, or if you only need to evaluate basic arithmetic skills in a single language.

AI model evaluation Multilingual natural language processing Mathematical reasoning Large language model benchmarking Research & development
No License Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 8 / 25
Maturity 7 / 25
Community 14 / 25

How are scores calculated?

Stars

42

Forks

7

Language

Python

License

Last pushed

May 22, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/QwenLM/PolyMath"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.