Math Reasoning Datasets LLM Tools

Datasets, benchmarks, and training resources specifically for mathematical reasoning tasks in LLMs, including word problems, visual math, problem generation, and mathematical text curation. Does NOT include general math tutoring platforms, creativity evaluation, or non-mathematical reasoning benchmarks.

There are 52 math reasoning datasets tools tracked. 1 score above 50 (established tier). The highest-rated is MMMU-Benchmark/MMMU at 52/100 with 548 stars.

Get all 52 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=math-reasoning-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	MMMU-Benchmark/MMMU This repo contains evaluation code for the paper "MMMU: A Massive...	52	Established	548	Python
2	pat-jj/DeepRetrieval [COLM’25] DeepRetrieval — 🔥 Training Search Agent by RLVR with Retrieval Outcome	47	Emerging	696	Python
3	lupantech/MathVista MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts	47	Emerging	355	Jupyter Notebook
4	x66ccff/liveideabench [𝐍𝐚𝐭𝐮𝐫𝐞 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬] 🤖💡 LiveIdeaBench: Evaluating LLMs' Scientific...	46	Emerging	23	Jupyter Notebook
5	ise-uiuc/magicoder [ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct	45	Emerging	2,086	Python
6	sherryzyh/physical_reasoning_toolkit A Python toolkit for physical reasoning in LLMs and VLMs. This toolkit...	43	Emerging	3	Python
7	iiis-ai/AutoMathText-V2 AutoMathText-V2: A 2.46 Trillion Token AI-Curated STEM Pretraining Dataset	42	Emerging	6	HTML
8	gsarti/verbalized-rebus Materials for "Non Verbis, Sed Rebus: Large Language Models are Weak Solvers...	41	Emerging	4	Jupyter Notebook
9	IAAR-Shanghai/xVerify xVerify: Efficient Answer Verifier for Reasoning Model Evaluations	40	Emerging	144	Jupyter Notebook
10	SuperBruceJia/Awesome-LLM-Self-Consistency Awesome LLM Self-Consistency: a curated list of Self-consistency in Large...	39	Emerging	120	—
11	microsoft/repoclassbench [ICML DMLR 2024] Repo that contains code for the paper titled: "Class-Level...	39	Emerging	17	Python
12	JunyiYe/CreativeMath [AAAI 2025] Assessing the Creativity of LLMs in Proposing Novel Solutions to...	38	Emerging	13	Jupyter Notebook
13	rxlqn/awesome-llm-self-reflection augmented LLM with self reflection	37	Emerging	139	—
14	GAIR-NLP/MathPile [NeurlPS D&B 2024] Generative AI for Math: MathPile	37	Emerging	419	Python
15	intuit-ai-research/DCR-consistency DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and...	36	Emerging	25	Python
16	killthefullmoon/PhyX PhyX: Does Your Model Have the "Wits" for Physical Reasoning?	36	Emerging	52	Python
17	yecchen/MIRAI Code and Data for "MIRAI: Evaluating LLM Agents for Event Forecasting"	36	Emerging	90	Python
18	EngineeringSoftware/codeditor Multilingual Code Co-Evolution Using Large Language Models	35	Emerging	13	Python
19	lt-asset/REPOCOD For our ACL25 Paper: Can Language Models Replace Programmers? RepoCod Says...	35	Emerging	26	Python
20	bigai-nlco/LooGLE ACL 2024 \| LooGLE: Long Context Evaluation for Long-Context Language Models	34	Emerging	195	Python
21	thehsansaeed/Questions-for-AI-Model-Testing This repository contains a curated set of logical, mathematical, and...	33	Emerging	8	—
22	zjunlp/ReCode [AAAI 2026] ReCode: Reinforced Code Knowledge Editing for API Updates	33	Emerging	24	Python
23	artificial-scientist-lab/SciMuse Interesting Scientific Idea Generation Using Knowledge Graphs and LLMs:...	32	Emerging	32	Python
24	DAMO-NLP-SG/M3Exam Data and code for paper "M3Exam: A Multilingual, Multimodal, Multilevel...	31	Emerging	103	Python
25	kg-bnu/SciMKG Source code of AAAI 2026 paper "SciMKG: A Multimodal Knowledge Graph for...	31	Emerging	3	Python
26	blacksnail789521/Time-Series-Reasoning-Survey A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models	31	Emerging	38	—
27	TianHongZXY/CoRe [ACL 2023] Solving Math Word Problems via Cooperative Reasoning induced...	31	Emerging	50	Python
28	TIGER-AI-Lab/AceCoder The official repo for "AceCoder: Acing Coder RL via Automated Test-Case...	31	Emerging	99	Python
29	yifanzhang-pro/BlueMO BlueMO: A Comprehensive Collection of Challenging Mathematical Olympiad...	30	Emerging	5	HTML
30	uni-medical/GMAI-MMBench GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards...	29	Experimental	82	—
31	yubol-bobo/MT-Consistency This repo investigates LLMs' tendency to exhibit acquiescence bias in...	29	Experimental	49	Python
32	ehsk/OpenQA-eval ACL 2023: Evaluating Open-Domain Question Answering in the Era of Large...	27	Experimental	47	Python
33	CodeEval-Pro/CodeEval-Pro [ACL'25 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating...	27	Experimental	37	Python
34	HarryYancy/SolidGeo SolidGeo: Measuring Multimodal Spatial Math Reasoning in Solid Geometry	26	Experimental	9	Python
35	ai-for-edu/ScratchMath Official Repo for Paper "Can MLLMs Read Students' Minds? Unpacking...	25	Experimental	3	Python
36	cyzhh/MMOS Mix of Minimal Optimal Sets (MMOS) of dataset has two advantages for two...	23	Experimental	74	Python
37	LiXin97/WirelessMathLM WirelessMathLM:Teaching Mathematical Reasoning for LLMs in Wireless...	23	Experimental	2	HTML
38	richardcsuwandi/cake [NeurIPS 2025] Context-Aware Kernel Evolution (CAKE)	23	Experimental	21	Python
39	marcusm117/DNA [ICLR 2026] Divide and Abstract: Autoformalization via Decomposition and...	22	Experimental	—	Python
40	jwallat/temporalrobustness A Study Into Temporal Robustness of LLMs	22	Experimental	2	Jupyter Notebook
41	carlomarxdk/trilemma-of-truth A research project on competing notions of truth in large language models.	21	Experimental	—	Python
42	I-Halder/Demystifying-LLM-as-a-Judge-Analytically-Tractable-Model-for-Inference-Time-Scaling Optimization of inference time sampling of large language models guided by a...	21	Experimental	—	Python
43	mshin77/mathipy mathipy: Multimodal item feature extraction for K-12 math assessment (Python...	21	Experimental	—	Python
44	MAC-AutoML/SocialOmni Benchmarking Audio-Visual Social Interactivity in Omni Models	20	Experimental	17	Python
45	GSkuza/Generalized-Theory-of-Mathematical-Indefiniteness The Generalized Theory of Mathematical Undefiniteness (GTMØ) is an...	18	Experimental	1	Python
46	maxpeeperkorn/creativity-parameter This repository contains the supplementary material / appendix to go with...	18	Experimental	2	Jupyter Notebook
47	robertopassaro/tales-of-2-minds Evaluating Creativity in Human and Large Language Model Narratives	15	Experimental	—	Jupyter Notebook
48	conceptmath/conceptmath [ACL 2024 Findings] The official repo for "ConceptMath: A Bilingual...	14	Experimental	24	Python
49	polymathbenchmark/polymathbenchmark.github.io A Challenging Multi-Modal Mathematical Reasoning Benchmark	13	Experimental	—	JavaScript
50	aauss/temporal-answer-qa Time to Revisit Exact Match (Findings of EMNLP 2025)	13	Experimental	—	Python
51	awsaf49/ai-math-olympiad AI Mathematical Olympiad - Progress Prize 1 \| Solve national-level math...	11	Experimental	—	Jupyter Notebook
52	imperialite/standardize This repository contains the code, data, and website assets for the...	11	Experimental	—	JavaScript