StonyBrookNLP/appworld

🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource Paper.

/ 100

Established

This project creates a realistic, simulated digital world with various common apps and user behaviors. It takes an agent's code as input and simulates its interactions within this world, then evaluates how well the agent completes complex tasks. This is for AI researchers and developers who are building and testing autonomous AI agents.

388 stars. Available on PyPI.

Use this if you need a high-fidelity, controllable environment to benchmark how well your AI agents can interact with software applications and perform coding-related tasks.

Not ideal if you are looking for a simple dataset for natural language understanding or a ready-to-deploy, end-user application.

AI agent development large language model evaluation function calling interactive coding software testing

Maintenance 10 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 20 / 25

How are scores calculated?

Stars

388

Forks

Language

Python

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Related agents

qualifire-dev/rogue

AI Agent Evaluator & Red Team Platform

microsoft/WindowsAgentArena

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...

future-agi/ai-evaluation

Evaluation Framework for all your AI related Workflows

agentscope-ai/OpenJudge

OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

dreadnode/AIRTBench-Code

Code Repository for: AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models

Explore AI Agents

All categories Trending AI Agent directory Insights