gersteinlab/ML-Bench

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code (https://arxiv.org/abs/2311.09835)

38
/ 100
Emerging

This project helps evaluate how well large language models (LLMs) and AI agents can perform complete machine learning tasks. It takes detailed instructions for a task and relevant code repositories as input, then assesses the quality of the generated code to solve the problem. Data scientists, machine learning engineers, and researchers can use this to benchmark and understand the capabilities of various AI models for real-world ML workflows.

318 stars. No commits in the last 6 months.

Use this if you need to systematically test and compare how different large language models and AI agents handle end-to-end machine learning coding challenges, from understanding requirements to generating executable code.

Not ideal if you are looking for a tool to develop or deploy machine learning models directly, as its primary purpose is evaluation rather than model creation or production.

machine-learning-benchmarking large-language-models AI-agent-evaluation code-generation ML-workflow-automation
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 10 / 25

How are scores calculated?

Stars

318

Forks

12

Language

Python

License

MIT

Last pushed

Jul 31, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/gersteinlab/ML-Bench"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.