Thinking-in-Space-多模态大语言模型如何观察记忆和回忆空间纽大谢赛宁团队,-耶鲁大学,-斯坦福李飞飞
Thinking in Space: 多模态大语言模型如何观察、记忆和回忆空间(纽大(谢赛宁团队), 耶鲁大学, 斯坦福(李飞飞))
[Thinking in Space: How Multimodal Large Language Models See, Remember and
Recall Spaces](
“Thinking in Space: How Multimodal Large Language Models See, Remember and
Recall Spaces”)
Thinking in Space
How Multimodal Large Language Models See, Remember and Recall Spaces
VSI-Bench : We introduce a high-quality benchmark for the evaluation of
3D, video-based, visual-spatial intelligence of MLLMs.
Evaluation : We evaluate VSI-Bench on open- and closed-source MLLMs and
find that MLLMs exhibit competitive—though subhuman—visual-spatial
intelligence.
Linguistic Analysis : We attribute VSI-Bench performance to spatial
intelligence capabilities and show the differences between spatial and
linguistic intelligence.
Visual Analysis : We illuminate how MLLMs remember spaces via cognitive
maps and show how explicitly remembering spaces improves spatial capabilities.
Authors
△* △* △* ▲* ★ △
Affiliations
△ ▲ ★ *Equal contribution
Date
Dec. 18th, 2024
Figure 1: Can Multimodal LLMs “think spatially” when presented with a video recording of an environment? Can they build an accurate, implicit “cognitive map” that allows them to answer questions about a space? What are the strengths and limitations of using MLLMs to enhance spatial intelligence? We dig into these questions by setting up video data for MLLMs to watch, building a VQA benchmark to check their recall, and examining what the MLLMs actually remember and understand. We present a novel video-based visual-spatial intelligence benchmark (VSI- Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. Our evaluation reveals that MLLMs exhibit competitive visual-spatial intelligence, if still well short of human-level. To understand the MLLMs’ behavior, we probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. 提出了一个全新的基于视频的视觉-空间智能基准测试(VSI- Bench),包含超过5000个问答对,并发现多模态大语言模型(MLLMs)展现出具有竞争力但仍未达到人类水平的视觉- 空间智能。我们的评估表明,尽管与人类水平仍有显著差距,MLLMs在视觉- 空间智能方面表现出较强的竞争力。为了理解MLLMs的行为,通过语言和视觉方式探究模型如何表达其空间思维,并发现尽管空间推理能力仍是MLLMs提升基准性能的主要瓶颈,但局部世界模型和空间感知能力确实在这些模型中有所显现。
VSI-Bench
Benchmark Overview: We develop VSI-Bench, a benchmark to evaluate the visual-spatial intelligence of Multimodal LLMs (MLLMs) using over 5,000 question-answer pairs derived from 288 egocentric videos sourced from the validation sets of public indoor 3D scene reconstruction datasets ScanNet, ScanNet++, and ARKitScenes. VSI-Bench includes eight tasks under three task types: configurational, measurement estimation, and spatiotemporal. See Fig. 2 for an overview of the tasks in VSI-Bench and Fig. 3 for dataset statistics. Iteratively refined for quality, VSI-Bench provides a foundation to study the connection between MLLMs and 3D reconstruction.
Figure 2: Tasks demonstration of VSI-Bench. Note: the questions above are simplified slightly for clarity and brevity.
Figure 3: Benchmark Statistics. Left : The distribution of tasks across three main categories. Right : The video length statistic. VSI-Bench 基准概述: 开发了VSI-Bench,这是一个用于评估多模态大语言模型(MLLMs)视觉- 空间智能的基准测试,包含超过5000个问答对,这些问答对来源于公共室内3D场景重建数据集ScanNet、ScanNet++和ARKitScenes验证集中提取的288段以自我为中心的视频。VSI- Bench涵盖八项任务,分为三种任务类型:构型分析、测量估计和时空推理。有关VSI- Bench任务的概览,请参见图2;有关数据集统计信息,请参见图3。经过多次迭代优化以确保质量,VSI- Bench为研究MLLMs与3D重建之间的关联提供了基础。 VSI-Bench Construction: We develop a robust pipeline to construct VSI-Bench that enables high- quality question-answer (QA) pair generation at scale. Starting with data collection and unification, we standardize diverse 3D indoor scene datasets into a unified meta-information format, incorporating object categories, bounding boxes, and video specifications to support dataset-agnostic QA generation. QA pairs are generated using automated annotations from meta- information and task-specific question templates, with route planning tasks manually annotated. To ensure quality, we implement a human-in-the-loop review process, iteratively refining question templates, annotations, and QA generation rules by addressing ambiguities and errors flagged by evaluators.
Figure 4: Benchmark curation pipeline. The pipeline unifies datasets into a standardized format and semantic space for consistent processing. QA pairs are then generated through both human annotation and question templates. To ensure quality, human verification is implemented at all key stages for filtering low-quality videos, annotations, and ambiguous QA pairs. VSI-Bench 的构建: 我们开发了一套稳健的流水线来构建 VSI- Bench,以支持大规模高质量问答(QA)对的生成。从数据收集与统一化入手,我们将多样化的3D室内场景数据集标准化为统一的元信息格式,整合了物体类别、边界框和视频规格等信息,以便支持与具体数据集无关的问答生成。问答对通过元信息的自动标注和任务特定的问题模板生成,其中路径规划任务采用人工标注的方式完成。为确保质量,我们引入了人工参与的审查流程,通过解决评估者标记的模糊性和错误,迭代优化问题模板、标注内容以及问答生成规则。
Evaluation on VSI-Bench
Evaluation Setup :
We benchmark 15 video-supporting MLLMs from diverse model families. For proprietary models, we consider Gemini-1.5 and GPT-4o. For open-source models, we evaluate models from InternVL2, ViLA, LongViLA, LongVA, LLaVA- OneVision, and LLaVA-NeXT-Video. All evaluations are conducted in zero-shot settings with default prompts and greedy decoding for reproducibility. Tasks are evaluated using either Multiple-Choice Answer (MCA) accuracy or our proposed Mean Relative Accuracy (MRA) for Numerical Answer (NA) tasks.
Baselines include random selection and frequency-based answer selection to identify performance gains due to distribution biases. Additionally, human performance is assessed on a randomly sampled subset of 400 questions (VSI- Bench tiny), with metrics compared to Gemini-1.5 Pro. Main Results :
Human evaluators achieve an average accuracy of 79%, outperforming the best model by 33%, with near-perfect performance (94%-100%) on configuration and spatiotemporal tasks. However, the gap narrows on measurement tasks that require precise estimation, where MLLMs demonstrate relative strength in quantitative tasks. Among proprietary models, Gemini-1.5 Pro stands out, significantly exceeding chance baselines and approaching human performance in tasks like absolute distance and room size estimation, despite being trained only on 2D digital data. Top-performing open-source models, such as LLaVA-NeXT-Video-72B and LLaVA-OneVision-72B, achieve competitive results, trailing Gemini-1.5 Pro by just 4%-5%. However, most open-source models (7/12) fall below chance baselines, revealing notable deficiencies in visual-spatial intelligence.
Table 1: Evaluation on VSI-Bench. Left : Dark gray indicates the best result among all models and light gray indicates the best result among open-source models. † indicates results on VSI-Bench (tiny) set. Right : Results including the top-3 open-source models. Blind Evaluation :
We compare MLLMs’ performance against “Chance Level (frequency)” and “Vision Disabled” (blind) results, averaged across six top models (three open-source and three closed-source). The consistent improvements in “Enabled-Disabled” and the general degradation in “Disabled-Chance” highlight the importance of video input for VSI-Bench, as blind models perform worse than chance. However, MLLMs struggle to surpass chance level on tasks such as absolute distance estimation, route planning, and relative direction, reflecting the inherent difficulty of these tasks. Interestingly, “Vision Disabled” models significantly outperform chance on object size tasks, likely due to the integration of common-sense knowledge from language model training.
Figure 5 : Performance comparisons between Vision Enabled (w/ video), Vision Disabled (w/o video) and Chance Level (Freq.). VSI-Bench 的评估 评估设置: 对来自不同模型家族的15个支持视频的多模态大语言模型(MLLMs)进行了基准测试。对于闭源模型,我们选择了 Gemini-1.5 和 GPT-4o;对于开源模型,评估了来自 InternVL2、ViLA、LongViLA、LongVA、LLaVA-OneVision 和 LLaVA- NeXT-Video 的模型。所有评估均在零样本(zero- shot)设置下进行,使用默认提示和贪婪解码以确保结果的可复现性。任务评估采用多项选择答案(MCA)准确率或我们提出的数值答案(NA)任务的平均相对准确率(MRA)。 基线包括随机选择和基于频率的答案选择,以识别因数据分布偏差带来的性能提升。此外,人类表现通过从 VSI-Bench 中随机抽样的 400 个问题子集(VSI-Bench tiny)进行评估,并将指标与 Gemini-1.5 Pro 进行对比。 主要结果: 人类评估者在任务中平均准确率达到 79%,比最佳模型高出 33%。在构型分析和时空任务中,人类表现接近完美(94%-100%)。然而,在需要精确估计的测量任务中,差距缩小,MLLMs 在定量任务中表现出相对优势。在闭源模型中,Gemini-1.5 Pro 表现突出,显著超越随机基线,并在绝对距离和房间尺寸估计等任务中接近人类水平,尽管其仅基于 2D 数字数据训练。表现最佳的开源模型,如 LLaVA-NeXT- Video-72B 和 LLaVA-OneVision-72B,取得了具有竞争力的结果,仅落后于 Gemini-1.5 Pro 4%-5%。然而,大多数开源模型(12 个中的 7 个)表现低于随机基线,显示出其在视觉-空间智能方面的显著不足。 盲测评估: 将多模态大语言模型(MLLMs)的表现与“随机水平(频率)”和“视觉禁用”(盲测)结果进行对比,取六个顶级模型(三个开源和三个闭源)的平均值。在“启用- 禁用”之间的持续改进,以及“禁用-随机”之间的普遍退化,凸显了视频输入对 VSI-Bench 的重要性,因为盲测模型的表现甚至比随机猜测还差。然而,MLLMs 在绝对距离估计、路径规划和相对方向等任务上难以超越随机水平,这反映了这些任务的固有难度。有趣的是,在物体尺寸任务中,“视觉禁用”模型显著优于随机水平,这可能得益于语言模型训练中整合的常识知识。
How MLLMs Think in Space Linguistically
To better understand when and why models succeed or fail and to elucidate the facets of visual-spatial intelligence they possess, we examine how MLLMs think in space linguistically. Case Studies :
In the success example, the model demonstrates advanced video understanding with accurate timestamped descriptions and a correct step-by-step reasoning process. The use of a global coordinate system suggests that MLLMs may construct implicit world models by integrating spatial context and reasoning. In the error case, the model fails in egocentric-allocentric transformation, incorrectly interpreting a video sequence due to reliance on the egocentric view, leading to a flawed spatial inference.
Figure 6 : Examples of how a MLLM thinks as seen in self-explanations. MLLMs 在空间中的语言化思维 为了更好地理解模型何时成功或失败,并阐明其具备的视觉-空间智能的不同方面,我们研究了 MLLMs 如何在空间中以语言化的方式进行思考。 案例研究: 在一个成功的案例中,模型展示了先进的视频理解能力,提供了准确的时间戳描述和正确的逐步推理过程。使用全局坐标系表明,MLLMs 可能通过整合空间上下文和推理来构建隐式的“世界模型”。而在一个错误案例中,模型在自我中心到非自我中心转换(egocentric-allocentric transformation)上失败,由于过度依赖自我视角,导致对视频序列的错误解读,最终产生有缺陷的空间推断。 Error Analysis :
Analysis of errors from the best-performing MLLM on VSI-Bench (tiny) identifies four main error types: visual perception, linguistic intelligence, relational reasoning, and egocentric-allocentric transformation. Figure 6 reveals that 71% of errors stem from spatial reasoning, particularly in understanding distance, size, and direction. This indicates that spatial reasoning remains the key bottleneck for improving MLLM performance on VSI-Bench.
Figure 7 : Human-conducted analysis of errors by type. 错误分析: 对 VSI-Bench(tiny)上表现最佳的 MLLM 的错误分析识别出四种主要错误类型:视觉感知、语言智能、关系推理以及自我中心到非自我中心转换。图6显示,71% 的错误源于空间推理,特别是在距离、尺寸和方向的理解上。这表明空间推理仍然是提升 MLLM 在 VSI-Bench 上表现的主要瓶颈。
Finding 1: Spatial reasoning is the primary bottleneck for MLLM performance on VSI- Bench. 发现1:空间推理是 MLLM 在 VSI-Bench 上表现的主要瓶颈。
Limits of CoT Methods in Visuospatial Tasks :
We investigate three prompting techniques—Zero-Shot Chain-of-Thought (CoT), Self-Consistency with CoT, and Tree-of-Thoughts (ToT)—to improve MLLM reasoning on VSI-Bench. Surprisingly, all three methods led to performance degradation (see Fig. 8), with Zero-Shot CoT and ToT reducing average performance by 4%, and Self-Consistency falling 1.1% below the baseline. While the appearance order and absolute distance estimation tasks saw slight improvements due to reduced linguistic errors, the room size and object size tasks suffer a large 8% to 21% decrease, showing that encouraging a model to think more can be not just unreliable but downright harmful.
Figure 8 : Relative improvements of CoT, self-consistency and Tree-of- Thought compared to the baseline. Meanwhile, as shown in Tab. 2, ZeroShot CoT achieves a 1.6% improvement on the general video understanding benchmark VideoMME.
Case Performance Gemini-1.5 Pro (w/o CoT) 77.2 Gemini-1.5 Pro (w/ CoT) 79.8 Table 2: Gemini-1.5 Pro CoT performance on a 500-questions subset in VideoMME. 链式思维方法在视觉空间任务中的局限性: 研究了三种提示技术——零样本链式思维(Zero-Shot Chain-of-Thought, CoT)、结合自一致性(Self-Consistency with CoT)以及思维树(Tree-of-Thoughts, ToT),以改进 MLLM 在 VSI-Bench 上的推理能力。令人惊讶的是,所有三种方法均导致性能下降(见图8)。其中,零样本 CoT 和 ToT 使平均性能降低了 4%,而自一致性则比基线低了 1.1%。尽管在顺序出现和绝对距离估计任务中,因语言错误减少而略有改善,但在房间尺寸和物体尺寸任务上的表现却大幅下降了 8% 至 21%。这表明,鼓励模型“多思考”不仅可能是不可靠的,甚至可能带来负面影响。 同时,如表2所示,零样本链式思维(ZeroShot CoT)在通用视频理解基准 VideoMME 上实现了 1.6% 的性能提升。
Finding 2: Linguistic prompting techniques, although effective in language reasoning and general visual tasks, are primarily harmful for spatial reasoning. 发现2: 语言提示技术尽管在语言推理和一般视觉任务中有效,但在空间推理中主要表现为有害。
How MLLMs Think in Space Visually
Since humans subconsciously build mental representations of space when reasoning spatially, we explore how MLLMs remember spaces. Probing via Cognitive Maps :
We evaluate MLLMs’ ability to create cognitive maps, a framework for spatial representation, by prompting Gemini-1.5 Pro to predict object center positions within a 10 x 10 grid based on video input. Accuracy is measured by comparing predicted object distances with ground truth maps, considering deviations within one grid unit as correct. The model achieves 64% accuracy in positioning close objects, which demonstrates its strong local spatial awareness. However, the model does struggle with larger distances, which reflects the challenges it faces in forming global spatial representations from discrete video frames.
Figure 9. Left : Visualizations of cognitive maps from MLLM and GT. Right : Locality of the MLLM’s predicted cognitive maps. 由于人类在进行空间推理时会下意识地构建心理空间表征,我们探索了 MLLMs 如何“记忆”空间。 通过认知地图的探测: 我们评估了 MLLMs 构建认知地图的能力——一种空间表征框架。通过提示 Gemini-1.5 Pro 基于视频输入预测物体中心位置在一个 10×10 网格中的分布来实现这一目标。准确率通过比较预测的物体距离与真实地图计算得出,并将偏差在一个网格单位内的预测视为正确。模型在定位近处物体时达到了 64% 的准确率,这表明其具有较强的局部空间感知能力。然而,模型在处理较大距离时表现不佳,反映了其从离散视频帧中构建全局空间表征所面临的挑战。
Finding 3: When remembering spaces, a MLLM forms a series of local world models in its mind from a given video, rather than a unified global model. 发现3: 当记忆空间时,MLLM 从给定视频中构建了一系列局部世界模型,而非统一的全局模型。
Better Distance Reasoning via Cognitive Maps :
We explore whether cognitive maps could enhance MLLMs’ spatial reasoning by prompting Gemini-1.5 Pro to generate a map from video input and use it to answer relative distance questions. Results show a 10% accuracy improvement with the model’s own map and a 20%-32% gain using ground truth maps, highlighting the value of accurate mental imagery for enforcing global scene topology. This suggests cognitive mapping as a promising approach to improve MLLMs’ visual-spatial reasoning. (a) Cognitive map prompting.
Case Rel. Dist Acc. w/o Cog. map 46.0 w/ Cog. map 56.0 w/ Cog. map (GT) 66.0 (b) Cognitive map canvas size.
Cog. Map Src. Size Rel. Dist Acc. MLLM 10 × 10 56.0 MLLM 20 × 20 54.0 GT 10 × 10 66.0 GT 20 × 20 78.0 Table 3: Relative distance task with cognitive map. 通过认知地图改进距离推理: 探索了认知地图是否可以增强 MLLMs 的空间推理能力,通过提示 Gemini-1.5 Pro 从视频输入生成一张地图并用其回答相对距离问题。结果显示,使用模型自身生成的地图可提升 10% 的准确率,而使用真实地图则带来 20%-32% 的提升,这凸显了精确的心理图像对强化全局场景拓扑的价值。这表明认知地图是一种有前景的方法,可提升 MLLMs 的视觉-空间推理能力。
Conclusion
We study how models see, remember, and recall spaces by building VSI-Bench and investigating the performance and behavior of MLLMs on it. Our analysis of how MLLMs think in space linguistically and visually identifies existing strengths (e.g., prominent perceptual, temporal, and linguistic abilities) and bottlenecks for visual-spatial intelligence (e.g., egocentric- allocentric transformation and relational reasoning). While prevailing linguistic prompting methods fail to improve spatial reasoning, building explicit cognitive maps does enhance the spatial distance reasoning of MLLMs. 结论 我们通过构建 VSI-Bench 并研究 MLLMs 在其上的表现和行为,探讨了模型如何“观察、记忆和回忆”空间。我们对 MLLMs 在空间中语言化和视觉化思维方式的分析揭示了现有优势(例如显著的感知、时间与语言能力)以及视觉- 空间智能的瓶颈(例如自我中心到非自我中心转换和关系推理)。尽管主流的语言提示方法未能提升空间推理能力,但构建显式的认知地图确实增强了 MLLMs 的空间距离推理能力。
BibTeX
@article{yang2024think, title={{Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces}}, author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali W. and Han, Rilyn and Fei-Fei, Li and Xie, Saining}, year={2024}, journal={arXiv preprint arXiv:2412.14171}, }