THU-SI/Spatial-MLLM
[NeurIPS 2025] Official implementation of Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
This project helps anyone needing to understand and reason about the spatial relationships within video footage. It takes video input and can identify objects, their positions, and how they interact in space, outputting accurate answers to spatial questions. This is ideal for professionals in fields like surveillance, robotics, or video analysis who need detailed spatial intelligence from visual data.
447 stars.
Use this if you need to accurately extract and reason about spatial information from video recordings to understand complex scene layouts or object interactions.
Not ideal if your primary need is general object recognition or activity detection without a strong emphasis on precise spatial understanding and reasoning.
Stars
447
Forks
17
Language
Python
License
MIT
Category
Last pushed
Feb 05, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/THU-SI/Spatial-MLLM"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
KimMeen/Time-LLM
[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...
om-ai-lab/VLM-R1
Solve Visual Understanding with Reinforced VLMs
bytedance/SALMONN
SALMONN family: A suite of advanced multi-modal LLMs
NVlabs/OmniVinci
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
fixie-ai/ultravox
A fast multimodal LLM for real-time voice