mlbio-epfl/HeurekaBench

[ICLR 2026] A framework to "create benchmarks" and "evaluate AI co-scientists" in experimental data-driven real-world scientific research.

/ 100

Experimental

HeurekaBench is a framework designed for scientific researchers to create and use benchmarks that evaluate how well AI co-scientists can perform data-driven research tasks. It takes scientific studies and their associated code, processes them using AI, and generates challenging questions and validated answers. Researchers can then use these benchmarks to test and improve their own AI agents in specific domains, such as single-cell biology.

Use this if you are a scientific researcher developing or evaluating AI agents that act as 'co-scientists' and need a robust, domain-specific benchmark to test their ability to analyze experimental data and answer open-ended research questions.

Not ideal if you are looking for a tool to perform scientific data analysis directly, as this framework is focused on evaluating AI agents rather than being a primary analysis tool itself.

scientific-research single-cell-biology biomedical-data-analysis AI-evaluation experimental-data

No License No Package No Dependents

Maintenance 10 / 25

Adoption 5 / 25

Maturity 5 / 25

Community 7 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

StonyBrookNLP/appworld

🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and...

qualifire-dev/rogue

AI Agent Evaluator & Red Team Platform

microsoft/WindowsAgentArena

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...

future-agi/ai-evaluation

Evaluation Framework for all your AI related Workflows

agentscope-ai/OpenJudge

OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

Explore AI Agents

All categories Trending AI Agent directory Insights