Math Reasoning Datasets LLM Tools

Datasets, benchmarks, and training resources specifically for mathematical reasoning tasks in LLMs, including word problems, visual math, problem generation, and mathematical text curation. Does NOT include general math tutoring platforms, creativity evaluation, or non-mathematical reasoning benchmarks.

There are 52 math reasoning datasets tools tracked. 1 score above 50 (established tier). The highest-rated is MMMU-Benchmark/MMMU at 52/100 with 548 stars.

Get all 52 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=math-reasoning-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 MMMU-Benchmark/MMMU

This repo contains evaluation code for the paper "MMMU: A Massive...

52
Established
2 pat-jj/DeepRetrieval

[COLM’25] DeepRetrieval — 🔥 Training Search Agent by RLVR with Retrieval Outcome

47
Emerging
3 lupantech/MathVista

MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts

47
Emerging
4 x66ccff/liveideabench

[𝐍𝐚𝐭𝐮𝐫𝐞 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬] 🤖💡 LiveIdeaBench: Evaluating LLMs' Scientific...

46
Emerging
5 ise-uiuc/magicoder

[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct

45
Emerging
6 sherryzyh/physical_reasoning_toolkit

A Python toolkit for physical reasoning in LLMs and VLMs. This toolkit...

43
Emerging
7 iiis-ai/AutoMathText-V2

AutoMathText-V2: A 2.46 Trillion Token AI-Curated STEM Pretraining Dataset

42
Emerging
8 gsarti/verbalized-rebus

Materials for "Non Verbis, Sed Rebus: Large Language Models are Weak Solvers...

41
Emerging
9 IAAR-Shanghai/xVerify

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

40
Emerging
10 SuperBruceJia/Awesome-LLM-Self-Consistency

Awesome LLM Self-Consistency: a curated list of Self-consistency in Large...

39
Emerging
11 microsoft/repoclassbench

[ICML DMLR 2024] Repo that contains code for the paper titled: "Class-Level...

39
Emerging
12 JunyiYe/CreativeMath

[AAAI 2025] Assessing the Creativity of LLMs in Proposing Novel Solutions to...

38
Emerging
13 rxlqn/awesome-llm-self-reflection

augmented LLM with self reflection

37
Emerging
14 GAIR-NLP/MathPile

[NeurlPS D&B 2024] Generative AI for Math: MathPile

37
Emerging
15 intuit-ai-research/DCR-consistency

DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and...

36
Emerging
16 killthefullmoon/PhyX

PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

36
Emerging
17 yecchen/MIRAI

Code and Data for "MIRAI: Evaluating LLM Agents for Event Forecasting"

36
Emerging
18 EngineeringSoftware/codeditor

Multilingual Code Co-Evolution Using Large Language Models

35
Emerging
19 lt-asset/REPOCOD

For our ACL25 Paper: Can Language Models Replace Programmers? RepoCod Says...

35
Emerging
20 bigai-nlco/LooGLE

ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models

34
Emerging
21 thehsansaeed/Questions-for-AI-Model-Testing

This repository contains a curated set of logical, mathematical, and...

33
Emerging
22 zjunlp/ReCode

[AAAI 2026] ReCode: Reinforced Code Knowledge Editing for API Updates

33
Emerging
23 artificial-scientist-lab/SciMuse

Interesting Scientific Idea Generation Using Knowledge Graphs and LLMs:...

32
Emerging
24 DAMO-NLP-SG/M3Exam

Data and code for paper "M3Exam: A Multilingual, Multimodal, Multilevel...

31
Emerging
25 kg-bnu/SciMKG

Source code of AAAI 2026 paper "SciMKG: A Multimodal Knowledge Graph for...

31
Emerging
26 blacksnail789521/Time-Series-Reasoning-Survey

A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

31
Emerging
27 TianHongZXY/CoRe

[ACL 2023] Solving Math Word Problems via Cooperative Reasoning induced...

31
Emerging
28 TIGER-AI-Lab/AceCoder

The official repo for "AceCoder: Acing Coder RL via Automated Test-Case...

31
Emerging
29 yifanzhang-pro/BlueMO

BlueMO: A Comprehensive Collection of Challenging Mathematical Olympiad...

30
Emerging
30 uni-medical/GMAI-MMBench

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards...

29
Experimental
31 yubol-bobo/MT-Consistency

This repo investigates LLMs' tendency to exhibit acquiescence bias in...

29
Experimental
32 ehsk/OpenQA-eval

ACL 2023: Evaluating Open-Domain Question Answering in the Era of Large...

27
Experimental
33 CodeEval-Pro/CodeEval-Pro

[ACL'25 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating...

27
Experimental
34 HarryYancy/SolidGeo

SolidGeo: Measuring Multimodal Spatial Math Reasoning in Solid Geometry

26
Experimental
35 ai-for-edu/ScratchMath

Official Repo for Paper "Can MLLMs Read Students' Minds? Unpacking...

25
Experimental
36 cyzhh/MMOS

Mix of Minimal Optimal Sets (MMOS) of dataset has two advantages for two...

23
Experimental
37 LiXin97/WirelessMathLM

WirelessMathLM:Teaching Mathematical Reasoning for LLMs in Wireless...

23
Experimental
38 richardcsuwandi/cake

[NeurIPS 2025] Context-Aware Kernel Evolution (CAKE)

23
Experimental
39 marcusm117/DNA

[ICLR 2026] Divide and Abstract: Autoformalization via Decomposition and...

22
Experimental
40 jwallat/temporalrobustness

A Study Into Temporal Robustness of LLMs

22
Experimental
41 carlomarxdk/trilemma-of-truth

A research project on competing notions of truth in large language models.

21
Experimental
42 I-Halder/Demystifying-LLM-as-a-Judge-Analytically-Tractable-Model-for-Inference-Time-Scaling

Optimization of inference time sampling of large language models guided by a...

21
Experimental
43 mshin77/mathipy

mathipy: Multimodal item feature extraction for K-12 math assessment (Python...

21
Experimental
44 MAC-AutoML/SocialOmni

Benchmarking Audio-Visual Social Interactivity in Omni Models

20
Experimental
45 GSkuza/Generalized-Theory-of-Mathematical-Indefiniteness

The Generalized Theory of Mathematical Undefiniteness (GTMØ) is an...

18
Experimental
46 maxpeeperkorn/creativity-parameter

This repository contains the supplementary material / appendix to go with...

18
Experimental
47 robertopassaro/tales-of-2-minds

Evaluating Creativity in Human and Large Language Model Narratives

15
Experimental
48 conceptmath/conceptmath

[ACL 2024 Findings] The official repo for "ConceptMath: A Bilingual...

14
Experimental
49 polymathbenchmark/polymathbenchmark.github.io

A Challenging Multi-Modal Mathematical Reasoning Benchmark

13
Experimental
50 aauss/temporal-answer-qa

Time to Revisit Exact Match (Findings of EMNLP 2025)

13
Experimental
51 awsaf49/ai-math-olympiad

AI Mathematical Olympiad - Progress Prize 1 | Solve national-level math...

11
Experimental
52 imperialite/standardize

This repository contains the code, data, and website assets for the...

11
Experimental