25 public + 1 in-house datasets covering indoor, outdoor, tabletop, and movie scenes.
All manually annotated by multiple 3DV experts under strict quality control.
Spatial layout reasoning, motion understanding, planning, prediction, and cross-video reasoning.
Most models perform poorly on MMSI-Video-Bench, highlighting a ~60% human-AI gap.
🏠 Indoor Scene Perception Bench
🤖 Robot Bench
📍 Grounding Bench
| Rank | Model | Avg.(%) | Type |
|---|
@misc{lin2025mmsivideobenchholisticbenchmarkvideobased,
title={MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence},
author={Jingli Lin and Runsen Xu and Shaohao Zhu and Sihan Yang and Peizhou Cao and Yunlong Ran and Miao Hu and Chenming Zhu and Yiman Xie and Yilin Long and Wenbo Hu and Dahua Lin and Tai Wang and Jiangmiao Pang},
year={2025},
eprint={2512.10863},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.10863},
}