conceptmath/conceptmath

[ACL 2024 Findings] The official repo for "ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models".

/ 100

Experimental

This project helps AI researchers and developers systematically evaluate the mathematical reasoning abilities of large language models (LLMs). You input a set of math problems and the LLM's responses, and it outputs a detailed breakdown of accuracy across various mathematical concepts, in both English and Chinese. This is ideal for those building or comparing LLMs for tasks requiring robust mathematical understanding.

No commits in the last 6 months.

Use this if you need to understand not just whether an LLM gets a math problem right, but which specific mathematical concepts it struggles with.

Not ideal if you're looking for a general-purpose LLM evaluation framework beyond mathematical reasoning or a tool for daily mathematical calculations.

LLM-evaluation AI-benchmarking natural-language-processing mathematical-AI model-assessment

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 6 / 25

Maturity 8 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

—

Higher-rated alternatives

MMMU-Benchmark/MMMU

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal...

pat-jj/DeepRetrieval

[COLM’25] DeepRetrieval — 🔥 Training Search Agent by RLVR with Retrieval Outcome

lupantech/MathVista

MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts

x66ccff/liveideabench

[𝐍𝐚𝐭𝐮𝐫𝐞 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬] 🤖💡 LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea...

ise-uiuc/magicoder

[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct

Explore LLM Tools

All categories Trending LLM Tool directory Insights