SiyuanHuang95/ManipVQA

[IROS24 Oral]ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models

/ 100

Experimental

This project helps roboticists and AI researchers improve how robots understand and interact with objects in the real world. By training Multimodal Large Language Models (MLLMs) with specific visual data about how objects can be used and their physical properties, robots can better interpret natural language commands for manipulation tasks. It takes standard image-text data and outputs an MLLM enhanced with robotic manipulation intelligence.

102 stars. No commits in the last 6 months.

Use this if you are developing robotic systems and need your robots to better understand how to interact with objects based on visual cues and natural language instructions.

Not ideal if your primary focus is on general image understanding or natural language processing without a direct application to robotic manipulation.

robotics robot-manipulation AI-training affordance-learning robot-vision

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 8 / 25

Community 5 / 25

How are scores calculated?

Stars

102

Forks

Language

Python

License

—

Higher-rated alternatives

xrsrke/toolformer

Implementation of Toolformer: Language Models Can Teach Themselves to Use Tools

MozerWang/AMPO

[ICLR 2026] Adaptive Social Learning via Mode Policy Optimization for Language Agents

real-stanford/reflect

[CoRL 2023] REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction

nsidn98/LLaMAR

Code for our paper LLaMAR: LM-based Long-Horizon Planner for Multi-Agent Robotics

BatsResearch/planetarium

Dataset and benchmark for assessing LLMs in translating natural language descriptions of...

Explore Transformer Models

All categories Trending Transformer directory Insights