MMSI-Video-Bench: A Holistic Benchmark for Video-based Spatial Intelligence

1. Shanghai AI Laboratory    2. Shanghai Jiaotong University    3. The Chinese University of Hong Kong    4. Zhejiang University    5. Beihang University
6. Xi'an Jiaotong University    7. University of Hong Kong    8. Fudan University    9. University of California, Los Angeles
*Equal Contribution Project Lead

Diverse Scenarios

25 public + 1 in-house datasets covering indoor, outdoor, tabletop, and movie scenes.

High Data Quality

All manually annotated by multiple 3DV experts under strict quality control.

Comprehensive Tasks

Spatial layout reasoning, motion understanding, planning, prediction, and cross-video reasoning.

Challenging Benchmark

Most models perform poorly on MMSI-Video-Bench, highlighting a ~60% human-AI gap.

Multifaceted Perspectives

🏠 Indoor Scene Perception Bench

🤖 Robot Bench

📍 Grounding Bench

MMSI-Video-Bench covers spatial layout reasoning, motion understanding, decision-making, and cross-video reasoning, providing a holistic evaluation of video-based spatial intelligence.

Overview

Quiz Questions

Question Type

Question text

Leaderboard

Rank Model Avg.(%) Type

BibTeX

@misc{lin2025mmsivideobenchholisticbenchmarkvideobased,
        title={MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence}, 
        author={Jingli Lin and Runsen Xu and Shaohao Zhu and Sihan Yang and Peizhou Cao and Yunlong Ran and Miao Hu and Chenming Zhu and Yiman Xie and Yilin Long and Wenbo Hu and Dahua Lin and Tai Wang and Jiangmiao Pang},
        year={2025},
        eprint={2512.10863},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2512.10863}, 
  }