StonyBrookNLP/appworld
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource Paper.
This project creates a realistic, simulated digital world with various common apps and user behaviors. It takes an agent's code as input and simulates its interactions within this world, then evaluates how well the agent completes complex tasks. This is for AI researchers and developers who are building and testing autonomous AI agents.
388 stars. Available on PyPI.
Use this if you need a high-fidelity, controllable environment to benchmark how well your AI agents can interact with software applications and perform coding-related tasks.
Not ideal if you are looking for a simple dataset for natural language understanding or a ready-to-deploy, end-user application.
Stars
388
Forks
59
Language
Python
License
Apache-2.0
Category
Last pushed
Feb 17, 2026
Commits (30d)
0
Dependencies
35
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/agents/StonyBrookNLP/appworld"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Related agents
qualifire-dev/rogue
AI Agent Evaluator & Red Team Platform
microsoft/WindowsAgentArena
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...
future-agi/ai-evaluation
Evaluation Framework for all your AI related Workflows
agentscope-ai/OpenJudge
OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards
dreadnode/AIRTBench-Code
Code Repository for: AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models