OSU-NLP-Group/ScienceAgentBench

[ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

/ 100

Established

This project helps scientific researchers and data scientists assess the capabilities of AI language agents in automating specific tasks within data-driven scientific discovery workflows. It takes a description of a scientific task and expects a self-contained Python program as output from an agent. The benchmark then rigorously evaluates the generated code based on correctness, execution results, and cost. It's designed for researchers developing or using AI agents in scientific domains.

128 stars.

Use this if you are a researcher or data scientist who needs to objectively evaluate how well an AI language agent can perform individual coding tasks within scientific research, before relying on it for end-to-end automation.

Not ideal if you are looking for an AI agent to directly perform end-to-end scientific discovery without needing to evaluate its intermediate coding capabilities, or if your tasks do not involve Python programming.

scientific-research data-science AI-evaluation computational-science research-automation

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 16 / 25

How are scores calculated?

Stars

128

Forks

Language

Python

License

MIT

Related agents

Tongyi-MAI/MobileWorld

Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments

ml-dev-bench/ml-dev-bench

ML-Dev-Bench is a benchmark for evaluating AI agents against various ML development tasks.

michaelabrt/clarte-benchmark

Paired A/B benchmark suite for Clarté - measures how dependency-graph intelligence affects AI...

zzhiyuann/agent-bench

Benchmarking framework for AI agents — pytest for AI agents. Define tasks in YAML, run against...

MSKazemi/ExaBench-QA

ExaBench-QA is a benchmark and dataset for evaluating role-aware, LLM-based AI agents for...

Explore AI Agents

All categories Trending AI Agent directory Insights