gersteinlab/ML-Bench
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code (https://arxiv.org/abs/2311.09835)
This project helps evaluate how well large language models (LLMs) and AI agents can perform complete machine learning tasks. It takes detailed instructions for a task and relevant code repositories as input, then assesses the quality of the generated code to solve the problem. Data scientists, machine learning engineers, and researchers can use this to benchmark and understand the capabilities of various AI models for real-world ML workflows.
318 stars. No commits in the last 6 months.
Use this if you need to systematically test and compare how different large language models and AI agents handle end-to-end machine learning coding challenges, from understanding requirements to generating executable code.
Not ideal if you are looking for a tool to develop or deploy machine learning models directly, as its primary purpose is evaluation rather than model creation or production.
Stars
318
Forks
12
Language
Python
License
MIT
Category
Last pushed
Jul 31, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/gersteinlab/ML-Bench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)