THU-SI/Spatial-MLLM

[NeurIPS 2025] Official implementation of Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

/ 100

Emerging

This project helps anyone needing to understand and reason about the spatial relationships within video footage. It takes video input and can identify objects, their positions, and how they interact in space, outputting accurate answers to spatial questions. This is ideal for professionals in fields like surveillance, robotics, or video analysis who need detailed spatial intelligence from visual data.

447 stars.

Use this if you need to accurately extract and reason about spatial information from video recordings to understand complex scene layouts or object interactions.

Not ideal if your primary need is general object recognition or activity detection without a strong emphasis on precise spatial understanding and reasoning.

video-analysis robotics-navigation surveillance visual-intelligence scene-understanding

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 15 / 25

Community 11 / 25

How are scores calculated?

Stars

447

Forks

Language

Python

License

MIT

Higher-rated alternatives

KimMeen/Time-LLM

[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...

om-ai-lab/VLM-R1

Solve Visual Understanding with Reinforced VLMs

bytedance/SALMONN

SALMONN family: A suite of advanced multi-modal LLMs

NVlabs/OmniVinci

OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.

fixie-ai/ultravox

A fast multimodal LLM for real-time voice

Explore Transformer Models

All categories Trending Transformer directory Insights