Math Reasoning Datasets LLM Tools
Datasets, benchmarks, and training resources specifically for mathematical reasoning tasks in LLMs, including word problems, visual math, problem generation, and mathematical text curation. Does NOT include general math tutoring platforms, creativity evaluation, or non-mathematical reasoning benchmarks.
There are 52 math reasoning datasets tools tracked. 1 score above 50 (established tier). The highest-rated is MMMU-Benchmark/MMMU at 52/100 with 548 stars.
Get all 52 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=math-reasoning-datasets&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
MMMU-Benchmark/MMMU
This repo contains evaluation code for the paper "MMMU: A Massive... |
|
Established |
| 2 |
pat-jj/DeepRetrieval
[COLM’25] DeepRetrieval — 🔥 Training Search Agent by RLVR with Retrieval Outcome |
|
Emerging |
| 3 |
lupantech/MathVista
MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts |
|
Emerging |
| 4 |
x66ccff/liveideabench
[𝐍𝐚𝐭𝐮𝐫𝐞 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬] 🤖💡 LiveIdeaBench: Evaluating LLMs' Scientific... |
|
Emerging |
| 5 |
ise-uiuc/magicoder
[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct |
|
Emerging |
| 6 |
sherryzyh/physical_reasoning_toolkit
A Python toolkit for physical reasoning in LLMs and VLMs. This toolkit... |
|
Emerging |
| 7 |
iiis-ai/AutoMathText-V2
AutoMathText-V2: A 2.46 Trillion Token AI-Curated STEM Pretraining Dataset |
|
Emerging |
| 8 |
gsarti/verbalized-rebus
Materials for "Non Verbis, Sed Rebus: Large Language Models are Weak Solvers... |
|
Emerging |
| 9 |
IAAR-Shanghai/xVerify
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations |
|
Emerging |
| 10 |
SuperBruceJia/Awesome-LLM-Self-Consistency
Awesome LLM Self-Consistency: a curated list of Self-consistency in Large... |
|
Emerging |
| 11 |
microsoft/repoclassbench
[ICML DMLR 2024] Repo that contains code for the paper titled: "Class-Level... |
|
Emerging |
| 12 |
JunyiYe/CreativeMath
[AAAI 2025] Assessing the Creativity of LLMs in Proposing Novel Solutions to... |
|
Emerging |
| 13 |
rxlqn/awesome-llm-self-reflection
augmented LLM with self reflection |
|
Emerging |
| 14 |
GAIR-NLP/MathPile
[NeurlPS D&B 2024] Generative AI for Math: MathPile |
|
Emerging |
| 15 |
intuit-ai-research/DCR-consistency
DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and... |
|
Emerging |
| 16 |
killthefullmoon/PhyX
PhyX: Does Your Model Have the "Wits" for Physical Reasoning? |
|
Emerging |
| 17 |
yecchen/MIRAI
Code and Data for "MIRAI: Evaluating LLM Agents for Event Forecasting" |
|
Emerging |
| 18 |
EngineeringSoftware/codeditor
Multilingual Code Co-Evolution Using Large Language Models |
|
Emerging |
| 19 |
lt-asset/REPOCOD
For our ACL25 Paper: Can Language Models Replace Programmers? RepoCod Says... |
|
Emerging |
| 20 |
bigai-nlco/LooGLE
ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models |
|
Emerging |
| 21 |
thehsansaeed/Questions-for-AI-Model-Testing
This repository contains a curated set of logical, mathematical, and... |
|
Emerging |
| 22 |
zjunlp/ReCode
[AAAI 2026] ReCode: Reinforced Code Knowledge Editing for API Updates |
|
Emerging |
| 23 |
artificial-scientist-lab/SciMuse
Interesting Scientific Idea Generation Using Knowledge Graphs and LLMs:... |
|
Emerging |
| 24 |
DAMO-NLP-SG/M3Exam
Data and code for paper "M3Exam: A Multilingual, Multimodal, Multilevel... |
|
Emerging |
| 25 |
kg-bnu/SciMKG
Source code of AAAI 2026 paper "SciMKG: A Multimodal Knowledge Graph for... |
|
Emerging |
| 26 |
blacksnail789521/Time-Series-Reasoning-Survey
A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models |
|
Emerging |
| 27 |
TianHongZXY/CoRe
[ACL 2023] Solving Math Word Problems via Cooperative Reasoning induced... |
|
Emerging |
| 28 |
TIGER-AI-Lab/AceCoder
The official repo for "AceCoder: Acing Coder RL via Automated Test-Case... |
|
Emerging |
| 29 |
yifanzhang-pro/BlueMO
BlueMO: A Comprehensive Collection of Challenging Mathematical Olympiad... |
|
Emerging |
| 30 |
uni-medical/GMAI-MMBench
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards... |
|
Experimental |
| 31 |
yubol-bobo/MT-Consistency
This repo investigates LLMs' tendency to exhibit acquiescence bias in... |
|
Experimental |
| 32 |
ehsk/OpenQA-eval
ACL 2023: Evaluating Open-Domain Question Answering in the Era of Large... |
|
Experimental |
| 33 |
CodeEval-Pro/CodeEval-Pro
[ACL'25 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating... |
|
Experimental |
| 34 |
HarryYancy/SolidGeo
SolidGeo: Measuring Multimodal Spatial Math Reasoning in Solid Geometry |
|
Experimental |
| 35 |
ai-for-edu/ScratchMath
Official Repo for Paper "Can MLLMs Read Students' Minds? Unpacking... |
|
Experimental |
| 36 |
cyzhh/MMOS
Mix of Minimal Optimal Sets (MMOS) of dataset has two advantages for two... |
|
Experimental |
| 37 |
LiXin97/WirelessMathLM
WirelessMathLM:Teaching Mathematical Reasoning for LLMs in Wireless... |
|
Experimental |
| 38 |
richardcsuwandi/cake
[NeurIPS 2025] Context-Aware Kernel Evolution (CAKE) |
|
Experimental |
| 39 |
marcusm117/DNA
[ICLR 2026] Divide and Abstract: Autoformalization via Decomposition and... |
|
Experimental |
| 40 |
jwallat/temporalrobustness
A Study Into Temporal Robustness of LLMs |
|
Experimental |
| 41 |
carlomarxdk/trilemma-of-truth
A research project on competing notions of truth in large language models. |
|
Experimental |
| 42 |
I-Halder/Demystifying-LLM-as-a-Judge-Analytically-Tractable-Model-for-Inference-Time-Scaling
Optimization of inference time sampling of large language models guided by a... |
|
Experimental |
| 43 |
mshin77/mathipy
mathipy: Multimodal item feature extraction for K-12 math assessment (Python... |
|
Experimental |
| 44 |
MAC-AutoML/SocialOmni
Benchmarking Audio-Visual Social Interactivity in Omni Models |
|
Experimental |
| 45 |
GSkuza/Generalized-Theory-of-Mathematical-Indefiniteness
The Generalized Theory of Mathematical Undefiniteness (GTMØ) is an... |
|
Experimental |
| 46 |
maxpeeperkorn/creativity-parameter
This repository contains the supplementary material / appendix to go with... |
|
Experimental |
| 47 |
robertopassaro/tales-of-2-minds
Evaluating Creativity in Human and Large Language Model Narratives |
|
Experimental |
| 48 |
conceptmath/conceptmath
[ACL 2024 Findings] The official repo for "ConceptMath: A Bilingual... |
|
Experimental |
| 49 |
polymathbenchmark/polymathbenchmark.github.io
A Challenging Multi-Modal Mathematical Reasoning Benchmark |
|
Experimental |
| 50 |
aauss/temporal-answer-qa
Time to Revisit Exact Match (Findings of EMNLP 2025) |
|
Experimental |
| 51 |
awsaf49/ai-math-olympiad
AI Mathematical Olympiad - Progress Prize 1 | Solve national-level math... |
|
Experimental |
| 52 |
imperialite/standardize
This repository contains the code, data, and website assets for the... |
|
Experimental |