THUDM/VisualAgentBench

Towards Large Multimodal Models as Visual Foundation Agents

/ 100

Emerging

This tool helps researchers and AI developers systematically assess how well large multimodal models (LMMs) can act as 'agents' in visual environments. You input different LMMs and a set of diverse visual tasks (like navigating a virtual world, interacting with a graphical user interface, or designing web elements), and it outputs performance metrics showing the model's success rate in completing those tasks. This is for AI researchers and practitioners who develop or evaluate LMMs for agentic applications.

258 stars. No commits in the last 6 months.

Use this if you need to benchmark the capabilities of large multimodal models to understand and act within various visual environments, from embodied simulations to web interfaces and visual design tasks.

Not ideal if you are looking for a tool to build or deploy LMM-powered applications directly, as this focuses on model evaluation rather than application development.

AI model evaluation Multimodal AI research Agentic AI systems Computer vision benchmarking Embodied AI

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 9 / 25

How are scores calculated?

Stars

258

Forks

Language

Python

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

Explore LLM Tools

All categories Trending LLM Tool directory Insights