Mar 10, 2025arXiv:2503.06934

LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs

AI Summary

The paper introduces LLaFEA, a novel approach to improve fine-grained spatiotemporal reasoning in LMMs by fusing frame-based video with event camera data. LLaFEA uses cross-attention to integrate spatial features from frames and temporal features from events, followed by self-attention for global spatio-temporal associations, and embeds textual position/duration tokens into the fused visual space. Experiments on a newly constructed real-world frame-event dataset with coordinate instructions demonstrate that LLaFEA enhances spatio-temporal coordinate alignment, enabling LMMs to better interpret scenes at specific positions and times.

Key Contribution

Event cameras can fill the temporal sparsity gap in LMMs, enabling more precise spatiotemporal reasoning.

Abstract

Large multimodal models (LMMs) excel in scene understanding but struggle with fine-grained spatiotemporal reasoning due to weak alignment between linguistic and visual representations. Existing methods map textual positions and durations into the visual space encoded from frame-based videos, but suffer from temporal sparsity that limits language-vision temporal coordination. To address this issue, we introduce LLaFEA (Large Language and Frame-Event Assistant) to leverage event cameras for temporally dense perception and frame-event fusion. Our approach employs a cross-attention mechanism to integrate complementary spatial and temporal features, followed by self-attention matching for global spatio-temporal associations. We further embed textual position and duration tokens into the fused visual space to enhance fine-grained alignment. This unified framework ensures robust spatio-temporal coordinate alignment, enabling LMMs to interpret scenes at any position and any time. In addition, we construct a dataset of real-world frames-events with coordinate instructions and conduct extensive experiments to validate the effectiveness of the proposed method.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations3

Influential citations1

References57

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs

Related Papers