OSU-NLP-Group/ScienceAgentBench
[ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
This project helps scientific researchers and data scientists assess the capabilities of AI language agents in automating specific tasks within data-driven scientific discovery workflows. It takes a description of a scientific task and expects a self-contained Python program as output from an agent. The benchmark then rigorously evaluates the generated code based on correctness, execution results, and cost. It's designed for researchers developing or using AI agents in scientific domains.
128 stars.
Use this if you are a researcher or data scientist who needs to objectively evaluate how well an AI language agent can perform individual coding tasks within scientific research, before relying on it for end-to-end automation.
Not ideal if you are looking for an AI agent to directly perform end-to-end scientific discovery without needing to evaluate its intermediate coding capabilities, or if your tasks do not involve Python programming.
Stars
128
Forks
17
Language
Python
License
MIT
Category
Last pushed
Mar 05, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/agents/OSU-NLP-Group/ScienceAgentBench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related agents
Tongyi-MAI/MobileWorld
Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments
ml-dev-bench/ml-dev-bench
ML-Dev-Bench is a benchmark for evaluating AI agents against various ML development tasks.
michaelabrt/clarte-benchmark
Paired A/B benchmark suite for Clarté - measures how dependency-graph intelligence affects AI...
zzhiyuann/agent-bench
Benchmarking framework for AI agents — pytest for AI agents. Define tasks in YAML, run against...
MSKazemi/ExaBench-QA
ExaBench-QA is a benchmark and dataset for evaluating role-aware, LLM-based AI agents for...