QwenLM/PolyMath

[NeurIPS 2025 D&B Track] Evaluation Code Repo for Paper "PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts"

/ 100

Emerging

This project helps evaluate how well AI models can solve math problems across many languages and difficulty levels. You provide an AI model's mathematical problem-solving attempts in various languages, and it outputs a score reflecting the model's accuracy and consistency. It is designed for AI researchers and developers who are building or comparing large language models (LLMs) and need to rigorously test their mathematical reasoning abilities.

No commits in the last 6 months.

Use this if you are developing or benchmarking large language models and need a standardized, high-quality, multilingual dataset and evaluation framework to assess their mathematical reasoning skills.

Not ideal if you are looking for a tool to teach or learn mathematics, or if you only need to evaluate basic arithmetic skills in a single language.

AI model evaluation Multilingual natural language processing Mathematical reasoning Large language model benchmarking Research & development

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 8 / 25

Maturity 7 / 25

Community 14 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

ExtensityAI/symbolicai

A neurosymbolic perspective on LLMs

TIGER-AI-Lab/MMLU-Pro

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding...

deep-symbolic-mathematics/LLM-SR

[ICLR 2025 Oral] This is the official repo for the paper "LLM-SR" on Scientific Equation...

microsoft/interwhen

A framework for verifiable reasoning with language models.

zhudotexe/fanoutqa

Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language...

Explore Transformer Models

All categories Trending Transformer directory Insights