yjyddq/RiOSWorld

[NeurIPS 2025] Official repository of RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents

/ 100

Emerging

This project helps researchers and developers evaluate the potential risks of multimodal computer-use agents, particularly those designed to interact with a desktop environment. It takes a computer-use agent (like an AI assistant that controls a mouse and keyboard) as input and outputs a benchmark of its risk behaviors in various scenarios. The primary users are AI researchers and developers working on agent safety and trustworthiness.

117 stars.

Use this if you are developing or researching multimodal AI agents and need a standardized way to benchmark their safety and identify risky behaviors.

Not ideal if you are an end-user looking for a pre-built safety tool for AI agents, as this is a research and benchmarking framework.

AI Safety Agent Benchmarking Trustworthy AI AI Risk Assessment Multimodal Agents

No License No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 7 / 25

Community 9 / 25

How are scores calculated?

Stars

117

Forks

Language

HTML

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

StonyBrookNLP/appworld

🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and...

qualifire-dev/rogue

AI Agent Evaluator & Red Team Platform

microsoft/WindowsAgentArena

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...

future-agi/ai-evaluation

Evaluation Framework for all your AI related Workflows

RouteWorks/RouterArena

RouterArena: An open framework for evaluating LLM routers with standardized datasets, metrics,...

Explore AI Agents

All categories Trending AI Agent directory Insights