AgentBench and heurigym

AgentBench

Established

heurigym

Emerging

Maintenance 10/25

Adoption 10/25

Maturity 16/25

Community 19/25

Maintenance 10/25

Adoption 8/25

Maturity 16/25

Community 11/25

Stars: 3,234

Forks: 241

Downloads: —

Commits (30d): 0

Language: Python

License: Apache-2.0

Stars: 64

Forks: 6

Downloads: —

Commits (30d): 0

Language: Python

License: Apache-2.0

No Package No Dependents

About AgentBench

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

This project helps developers and researchers evaluate how well large language models (LLMs) can act as autonomous 'agents' in various real-world scenarios. It takes an LLM as input and runs it through a standardized set of tasks, like interacting with an operating system, using a database, or shopping online. The output is a performance score, showing how effectively the LLM completes these multi-step, interactive tasks.

LLM evaluation agentic AI AI research model benchmarking autonomous systems

About heurigym

cornell-zhang/heurigym

Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization (ICLR'26)

This project helps evaluate how effectively large language models (LLMs) can create and improve heuristics to solve complex real-world optimization challenges. It takes various combinatorial optimization problems, such as airline crew pairing or protein sequence design, and measures the quality of the heuristics generated by different LLMs. Researchers and practitioners working on applying LLMs to solve difficult optimization tasks would use this to benchmark and compare different LLM approaches.

Combinatorial Optimization Electronic Design Automation Computational Biology Logistics Planning Compiler Optimization

Related comparisons

AgentBench and bigcodebench AgentBench and LawBench AgentBench and LLM-Agent-Benchmark-List AgentBench and MemoryAgentBench

Scores updated daily from GitHub, PyPI, and npm data. How scores work