Feb 17, 2026arXiv:2602.15513

Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling

AI Summary

This paper introduces a non-parametric memory framework for embodied agents using MLLMs, explicitly separating episodic and semantic memory to address challenges in long-horizon tasks with limited context. The framework employs a retrieval-first, reasoning-assisted approach for episodic memory, recalling experiences based on semantic similarity and verifying them with visual reasoning, while semantic memory is constructed via program-style rule extraction for cross-environment generalization. Experiments on embodied question answering and exploration benchmarks (A-EQA and GOAT-Bench) demonstrate state-of-the-art performance, showing improvements in LLM-Match, LLM MatchXSPL, success rate, and SPL.

Key Contribution

Key contribution not extracted.

Abstract

Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling

Related Papers