OSU-NLP-Group/ScienceAgentBench

[ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

52
/ 100
Established

This project helps scientific researchers and data scientists assess the capabilities of AI language agents in automating specific tasks within data-driven scientific discovery workflows. It takes a description of a scientific task and expects a self-contained Python program as output from an agent. The benchmark then rigorously evaluates the generated code based on correctness, execution results, and cost. It's designed for researchers developing or using AI agents in scientific domains.

128 stars.

Use this if you are a researcher or data scientist who needs to objectively evaluate how well an AI language agent can perform individual coding tasks within scientific research, before relying on it for end-to-end automation.

Not ideal if you are looking for an AI agent to directly perform end-to-end scientific discovery without needing to evaluate its intermediate coding capabilities, or if your tasks do not involve Python programming.

scientific-research data-science AI-evaluation computational-science research-automation
No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 16 / 25

How are scores calculated?

Stars

128

Forks

17

Language

Python

License

MIT

Last pushed

Mar 05, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/agents/OSU-NLP-Group/ScienceAgentBench"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.