mims-harvard/CUREBench
CUREBench @ NeurIPS 2025: Benchmarking AI reasoning for therapeutic decision-making at scale
This project offers a starter kit for participants in the CURE-Bench bio-medical AI competition. It helps researchers and AI practitioners evaluate how well their AI models perform on complex therapeutic decision-making tasks. You provide medical case data in JSONL format and your AI model's configurations, and it generates a structured CSV submission file with your model's predictions and reasoning for evaluation.
129 stars.
Use this if you are participating in the CURE-Bench competition and need a straightforward way to generate and submit your AI model's predictions for therapeutic reasoning tasks.
Not ideal if you are looking for a general-purpose AI model evaluation framework outside the specific context and data format of the CURE-Bench competition.
Stars
129
Forks
31
Language
Python
License
—
Category
Last pushed
Dec 06, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/mims-harvard/CUREBench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems