gersteinlab/ML-Bench

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code (https://arxiv.org/abs/2311.09835)

/ 100

Emerging

This project helps evaluate how well large language models (LLMs) and AI agents can perform complete machine learning tasks. It takes detailed instructions for a task and relevant code repositories as input, then assesses the quality of the generated code to solve the problem. Data scientists, machine learning engineers, and researchers can use this to benchmark and understand the capabilities of various AI models for real-world ML workflows.

318 stars. No commits in the last 6 months.

Use this if you need to systematically test and compare how different large language models and AI agents handle end-to-end machine learning coding challenges, from understanding requirements to generating executable code.

Not ideal if you are looking for a tool to develop or deploy machine learning models directly, as its primary purpose is evaluation rather than model creation or production.

machine-learning-benchmarking large-language-models AI-agent-evaluation code-generation ML-workflow-automation

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 10 / 25

How are scores calculated?

Stars

318

Forks

Language

Python

License

MIT

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Explore LLM Tools

All categories Trending LLM Tool directory Insights