OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

JingLi Lin^*‡, Chenming Zhu^*, Runsen Xu, Xiaohan Mao, Xihui Liu,
Tai Wang^†, Jiangmiao Pang^†

^*Equal Contribution ^‡Project Lead ^†Corresponding Author

The video illustrates OST-Bench online setting, for the same question, the agent's answers evolve as it explores the scene, changing from t1 to t2 to t3, reflecting its continuously updated understanding.

Abstract

Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Further experimental analysis highlights the core challenges that must be addressed to improve online embodied reasoning.

Overview

1. Online Processing: Continuously process and reason over incrementally acquired observations in real-time.
2. Spatio-Temporal Integration: Dynamically combine current observations with memory to support spatial reasoning across time.

OST-Bench categorizes questions into three main categories. Each main category includes several subtypes; in total, the benchmark comprises 15 fine-grained question subtypes.

OST-Bench dataset statistics: (Left) Subtype distribution, (Right) Top: word cloud, Bottom: dialogue length distribution.

Select Scene:

[Click to reveal answer]

Leaderboard

Type	Model	Overall	Agent State	Agent Visible Info	Agent Object Spatial
Proprietary Models
	Claude-3.5-Sonnet	47.77	45.55	65.56	32.85
	Gemini-2.0-Flash	49.54	45.05	70.82	33.80
	Gemini-2.0-Flash (Thinking)	54.25	47.05	72.30	42.75
	GPT-4o	48.72	38.83	72.76	33.52
	GPT-4.1	53.40	47.23	76.46	37.65
Open-Source Models
	InternVL-2.5-8B	38.98	41.88	52.78	29.18
	InternVL-2.5-38B	50.78	45.38	73.88	33.95
	InternVL-2.5-78B	51.08	46.45	74.02	32.93
	QwenVL-2.5-7B	41.16	40.43	52.56	31.53
	QwenVL-2.5-32B	46.86	43.75	64.90	32.18
	QwenVL-2.5-72B	45.62	43.48	64.46	28.87
	LLaVA-Video-7B	39.28	33.50	58.32	28.80
	LLaVA-Video-72B	43.22	39.95	60.48	35.07
	LLaVA-Onevision-7B	40.36	31.08	55.24	33.63
	LLaVA-Onevision-72B	43.44	38.88	61.60	36.23
Baseline References
	Human-Level	84.05	74.83	93.40	81.02
	Chance-Level	35.73	44.28	32.42	35.72

Analysis

1. Performance Drop During Exploration.: we observe a significant decline in model accuracy as the agent continues to explore with an increasing number of sequential observations in the online setting.
2. Error Distribution Statistics: Reasoning Errors account for over 60% of all errors, making them the primary bottleneck.
3. Spatio-temporal Reasoning Shortcut: models tend to avoid retrieving key information, instead taking shortcuts and relying on shallow, unsupported inferences.
4. Cross-View Analysis: The model's performance drops significantly when faced with either complex clue-based spatial reasoning requirements or long-term memory retrieval demands.

Model performance over exploration time.

Distribution of three error types across the three task categories in OST-Bench.

Examples of Spatio-temporal Reasoning Shortcut.

Model performance across four task settings.

BibTeX

BibTex Code Here