Math Reasoning Datasets Transformer Models

There are 44 math reasoning datasets models tracked. 1 score above 70 (verified tier). The highest-rated is ExtensityAI/symbolicai at 71/100 with 1,677 stars. 1 of the top 10 are actively maintained.

Get all 44 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=math-reasoning-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	ExtensityAI/symbolicai A neurosymbolic perspective on LLMs	71	Verified	1,677	Python
2	TIGER-AI-Lab/MMLU-Pro The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task...	56	Established	347	Python
3	deep-symbolic-mathematics/LLM-SR [ICLR 2025 Oral] This is the official repo for the paper "LLM-SR" on...	49	Emerging	216	Python
4	microsoft/interwhen A framework for verifiable reasoning with language models.	44	Emerging	13	Python
5	zhudotexe/fanoutqa Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering...	44	Emerging	59	Python
6	xlang-ai/Binder [ICLR 2023] Code for the paper "Binding Language Models in Symbolic Languages"	43	Emerging	325	Python
7	HiThink-Research/MME-Finance [MM 2025] A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning	40	Emerging	44	Python
8	yifanzhang-pro/AutoMathText [ACL 2025 Findings] Autonomous Data Selection with Zero-shot Generative...	39	Emerging	90	Python
9	DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries [ACL 2025] Analyzing LLMs' Multilingual Knowledge Boundary Cognition Across...	37	Emerging	18	Jupyter Notebook
10	AlphaPav/mem-kk-logic On Memorization of Large Language Models in Logical Reasoning	37	Emerging	76	Python
11	TIGER-AI-Lab/StructLM Code and data for "StructLM: Towards Building Generalist Models for...	37	Emerging	76	Python
12	princeton-pli/AdaptMI [COLM 2025] Adaptive Skill-based In-context Math Instruction for Small...	37	Emerging	9	Python
13	TIGER-AI-Lab/LongICLBench Code and Data for "Long-context LLMs Struggle with Long In-context Learning"...	36	Emerging	112	Python
14	declare-lab/LLM-PuzzleTest This repository is maintained to release dataset and models for multimodal...	36	Emerging	113	Python
15	TIGER-AI-Lab/MAmmoTH Code and data for "MAmmoTH: Building Math Generalist Models through Hybrid...	36	Emerging	383	Jupyter Notebook
16	SeekingDream/DyCodeEval Official repository of the ICML2025 paper “Dynamic Benchmarking of Reasoning...	36	Emerging	255	Python
17	amazon-science/recode Releasing code for "ReCode: Robustness Evaluation of Code Generation Models"	35	Emerging	58	Python
18	akjindal53244/Arithmo Small and Efficient Mathematical Reasoning LLMs	35	Emerging	73	Python
19	google/curie Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long...	34	Emerging	29	Jupyter Notebook
20	martin-wey/CodeUltraFeedback CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)	33	Emerging	73	Python
21	ryokamoi/llm-self-correction-papers List of papers on Self-Correction of LLMs.	31	Emerging	80	—
22	surrey-nlp/LLM4MT_eval This repository is for our paper "What do large language model need for...	31	Emerging	4	Python
23	QwenLM/PolyMath [NeurIPS 2025 D&B Track] Evaluation Code Repo for Paper "PolyMath:...	31	Emerging	42	Python
24	conditionWang/FLNK Federated Learning with New Knowledge -- explore to incorporate various new...	30	Emerging	86	—
25	reasoning-machines/CoCoGen Language Models of Code are Few-Shot Commonsense Learners (EMNLP 2022)	30	Emerging	86	Python
26	bobxwu/learning-from-rewards-llm-papers A comrephensive collection of learning from rewards in the post-training and...	30	Emerging	64	—
27	neuro-symbolic-ai/explanation_based_ethical_reasoning Code and data for Paper "Enhancing Ethical Explanations of Large Language...	30	Emerging	6	Python
28	gersteinlab/Struc-Bench [NAACL 2024] Struc-Bench: Are Large Language Models Good at Generating...	30	Emerging	55	Python
29	zjunlp/DynamicKnowledgeCircuits [ACL 2025] How Do LLMs Acquire New Knowledge? A Knowledge Circuits...	29	Experimental	47	Jupyter Notebook
30	kaistAI/LangBridge [ACL 2024] LangBridge: Multilingual Reasoning Without Multilingual Supervision	28	Experimental	96	Python
31	WooooDyy/MathCritique Implementation for the research paper "Enhancing LLM Reasoning via Critique...	27	Experimental	55	Python
32	merlerm/In-Context-Symbolic-Regression Official code implementation for the ACL 2024 Student Research Workshop...	27	Experimental	17	Python
33	YangLing0818/SuperCorrect-llm [ICLR 2025] SuperCorrect: Advancing Small LLM Reasoning with Thought...	27	Experimental	87	Python
34	joeljang/continual-knowledge-learning [ICLR 2022] Towards Continual Knowledge Learning of Language Models	27	Experimental	91	Python
35	UCSC-VLAA/vllm-safety-benchmark [ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in...	25	Experimental	87	Python
36	MMStar-Benchmark/MMStar [NeurIPS 2024] This repo contains evaluation code for the paper "Are We on...	24	Experimental	204	Python
37	TIGER-AI-Lab/TableCoT The code and data for paper "Large Language Models are few(1)-shot Table...	23	Experimental	48	Python
38	iiis-ai/IterativeQuestionComposing [AAAI 2025] Augmenting Math Word Problems via Iterative Question Composing...	23	Experimental	23	Python
39	Eleanor-H/MUSTARD Code & data for ICLR 2024 spotlight paper: 🍯MUSTARD: Mastering Uniform...	21	Experimental	42	C++
40	yahskapar/LLMs-and-Probabilistic-Reasoning Data and software artifacts for the EMNLP 2024 (Main) paper "What Are the...	20	Experimental	5	Jupyter Notebook
41	yashmahe2020/math-tutor-research Research on Large Language Model capabilities in mathematics tutoring and...	19	Experimental	1	Jupyter Notebook
42	Liz-Atlas/last_frame_whitepaper A Modular Knowledge Transfer System for Large Language Models	17	Experimental	—	—
43	kreasof-ai/self-perturbation-learning Imagine "2 truth and a lie", but formalized as ML training objective	17	Experimental	1	Jupyter Notebook
44	Shengyu-Feng/TSMC4MATH [ICLR2025] Step-by-Step Reasoning for Math Problems via Twisted Sequential...	13	Experimental	5	Python