LiqiangJing/DSBench

[ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data Science Experts?

/ 100

Emerging

This project helps researchers and developers evaluate the performance of 'data science agents' — AI systems designed to perform data analysis and modeling. You input task instructions (which can include images and tables) and raw data files. The system then assesses how well the agent generates a solution to the given data science challenge, providing a benchmark for comparison. This is primarily for AI researchers and developers building or studying data science agents.

108 stars. No commits in the last 6 months.

Use this if you are developing or evaluating AI-driven data science agents and need a standardized way to measure their capability on realistic data tasks.

Not ideal if you are an end-user looking for a tool to perform data analysis directly, as this is a framework for evaluating other AI systems.

AI agent evaluation machine learning research data science automation AI performance benchmarking AI development

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

108

Forks

Language

Jupyter Notebook

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

Explore LLM Tools

All categories Trending LLM Tool directory Insights