conceptmath/conceptmath
[ACL 2024 Findings] The official repo for "ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models".
This project helps AI researchers and developers systematically evaluate the mathematical reasoning abilities of large language models (LLMs). You input a set of math problems and the LLM's responses, and it outputs a detailed breakdown of accuracy across various mathematical concepts, in both English and Chinese. This is ideal for those building or comparing LLMs for tasks requiring robust mathematical understanding.
No commits in the last 6 months.
Use this if you need to understand not just whether an LLM gets a math problem right, but which specific mathematical concepts it struggles with.
Not ideal if you're looking for a general-purpose LLM evaluation framework beyond mathematical reasoning or a tool for daily mathematical calculations.
Stars
24
Forks
—
Language
Python
License
—
Category
Last pushed
May 29, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/conceptmath/conceptmath"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
MMMU-Benchmark/MMMU
This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal...
pat-jj/DeepRetrieval
[COLM’25] DeepRetrieval — 🔥 Training Search Agent by RLVR with Retrieval Outcome
lupantech/MathVista
MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts
x66ccff/liveideabench
[𝐍𝐚𝐭𝐮𝐫𝐞 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬] 🤖💡 LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea...
ise-uiuc/magicoder
[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct