mims-harvard/CUREBench

CUREBench @ NeurIPS 2025: Benchmarking AI reasoning for therapeutic decision-making at scale

/ 100

Emerging

This project offers a starter kit for participants in the CURE-Bench bio-medical AI competition. It helps researchers and AI practitioners evaluate how well their AI models perform on complex therapeutic decision-making tasks. You provide medical case data in JSONL format and your AI model's configurations, and it generates a structured CSV submission file with your model's predictions and reasoning for evaluation.

129 stars.

Use this if you are participating in the CURE-Bench competition and need a straightforward way to generate and submit your AI model's predictions for therapeutic reasoning tasks.

Not ideal if you are looking for a general-purpose AI model evaluation framework outside the specific context and data format of the CURE-Bench competition.

AI-in-medicine therapeutic-decision-making biomedical-AI AI-competition medical-reasoning

No License No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 7 / 25

Community 20 / 25

How are scores calculated?

Stars

129

Forks

Language

Python

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

Explore LLM Tools

All categories Trending LLM Tool directory Insights