OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

*Equal Contribution Project Lead Corresponding Author

The video illustrates OST-Bench online setting, for the same question, the agent's answers evolve as it explores the scene, changing from t1 to t2 to t3, reflecting its continuously updated understanding.

Abstract

Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Further experimental analysis highlights the core challenges that must be addressed to improve online embodied reasoning.

Overview

  • 1. Online Processing: Continuously process and reason over incrementally acquired observations in real-time.
  • 2. Spatio-Temporal Integration: Dynamically combine current observations with memory to support spatial reasoning across time.
Scene image
    [Click to reveal answer]

    Leaderboard

    Type Model Overall Agent State Agent Visible Info Agent Object Spatial
    Proprietary Models
    Claude-3.5-Sonnet 47.77 45.55 65.56 32.85
    Gemini-2.0-Flash 49.54 45.05 70.82 33.80
    Gemini-2.0-Flash (Thinking) 54.25 47.05 72.30 42.75
    GPT-4o 48.72 38.83 72.76 33.52
    GPT-4.1 53.40 47.23 76.46 37.65
    Open-Source Models
    InternVL-2.5-8B 38.98 41.88 52.78 29.18
    InternVL-2.5-38B 50.78 45.38 73.88 33.95
    InternVL-2.5-78B 51.08 46.45 74.02 32.93
    QwenVL-2.5-7B 41.16 40.43 52.56 31.53
    QwenVL-2.5-32B 46.86 43.75 64.90 32.18
    QwenVL-2.5-72B 45.62 43.48 64.46 28.87
    LLaVA-Video-7B 39.28 33.50 58.32 28.80
    LLaVA-Video-72B 43.22 39.95 60.48 35.07
    LLaVA-Onevision-7B 40.36 31.08 55.24 33.63
    LLaVA-Onevision-72B 43.44 38.88 61.60 36.23
    Baseline References
    Human-Level 84.05 74.83 93.40 81.02
    Chance-Level 35.73 44.28 32.42 35.72

    Analysis

    • 1. Performance Drop During Exploration.: we observe a significant decline in model accuracy as the agent continues to explore with an increasing number of sequential observations in the online setting.
    • 2. Error Distribution Statistics: Reasoning Errors account for over 60% of all errors, making them the primary bottleneck.
    • 3. Spatio-temporal Reasoning Shortcut: models tend to avoid retrieving key information, instead taking shortcuts and relying on shallow, unsupported inferences.
    • 4. Cross-View Analysis: The model's performance drops significantly when faced with either complex clue-based spatial reasoning requirements or long-term memory retrieval demands.

    BibTeX

    BibTex Code Here