Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Further experimental analysis highlights the core challenges that must be addressed to improve online embodied reasoning.
Type | Model | Overall | Agent State | Agent Visible Info | Agent Object Spatial |
---|---|---|---|---|---|
Proprietary Models | |||||
Claude-3.5-Sonnet | 47.77 | 45.55 | 65.56 | 32.85 | |
Gemini-2.0-Flash | 49.54 | 45.05 | 70.82 | 33.80 | |
Gemini-2.0-Flash (Thinking) | 54.25 | 47.05 | 72.30 | 42.75 | |
GPT-4o | 48.72 | 38.83 | 72.76 | 33.52 | |
GPT-4.1 | 53.40 | 47.23 | 76.46 | 37.65 | |
Open-Source Models | |||||
InternVL-2.5-8B | 38.98 | 41.88 | 52.78 | 29.18 | |
InternVL-2.5-38B | 50.78 | 45.38 | 73.88 | 33.95 | |
InternVL-2.5-78B | 51.08 | 46.45 | 74.02 | 32.93 | |
QwenVL-2.5-7B | 41.16 | 40.43 | 52.56 | 31.53 | |
QwenVL-2.5-32B | 46.86 | 43.75 | 64.90 | 32.18 | |
QwenVL-2.5-72B | 45.62 | 43.48 | 64.46 | 28.87 | |
LLaVA-Video-7B | 39.28 | 33.50 | 58.32 | 28.80 | |
LLaVA-Video-72B | 43.22 | 39.95 | 60.48 | 35.07 | |
LLaVA-Onevision-7B | 40.36 | 31.08 | 55.24 | 33.63 | |
LLaVA-Onevision-72B | 43.44 | 38.88 | 61.60 | 36.23 | |
Baseline References | |||||
Human-Level | 84.05 | 74.83 | 93.40 | 81.02 | |
Chance-Level | 35.73 | 44.28 | 32.42 | 35.72 |
BibTex Code Here