SiyuanHuang95/ManipVQA
[IROS24 Oral]ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models
This project helps roboticists and AI researchers improve how robots understand and interact with objects in the real world. By training Multimodal Large Language Models (MLLMs) with specific visual data about how objects can be used and their physical properties, robots can better interpret natural language commands for manipulation tasks. It takes standard image-text data and outputs an MLLM enhanced with robotic manipulation intelligence.
102 stars. No commits in the last 6 months.
Use this if you are developing robotic systems and need your robots to better understand how to interact with objects based on visual cues and natural language instructions.
Not ideal if your primary focus is on general image understanding or natural language processing without a direct application to robotic manipulation.
Stars
102
Forks
3
Language
Python
License
—
Category
Last pushed
Aug 22, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/SiyuanHuang95/ManipVQA"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
xrsrke/toolformer
Implementation of Toolformer: Language Models Can Teach Themselves to Use Tools
MozerWang/AMPO
[ICLR 2026] Adaptive Social Learning via Mode Policy Optimization for Language Agents
real-stanford/reflect
[CoRL 2023] REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction
nsidn98/LLaMAR
Code for our paper LLaMAR: LM-based Long-Horizon Planner for Multi-Agent Robotics
BatsResearch/planetarium
Dataset and benchmark for assessing LLMs in translating natural language descriptions of...