THUDM/VisualAgentBench
Towards Large Multimodal Models as Visual Foundation Agents
This tool helps researchers and AI developers systematically assess how well large multimodal models (LMMs) can act as 'agents' in visual environments. You input different LMMs and a set of diverse visual tasks (like navigating a virtual world, interacting with a graphical user interface, or designing web elements), and it outputs performance metrics showing the model's success rate in completing those tasks. This is for AI researchers and practitioners who develop or evaluate LMMs for agentic applications.
258 stars. No commits in the last 6 months.
Use this if you need to benchmark the capabilities of large multimodal models to understand and act within various visual environments, from embodied simulations to web interfaces and visual design tasks.
Not ideal if you are looking for a tool to build or deploy LMM-powered applications directly, as this focuses on model evaluation rather than application development.
Stars
258
Forks
10
Language
Python
License
Apache-2.0
Category
Last pushed
Apr 24, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/THUDM/VisualAgentBench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems