Search papers, labs, and topics across Lattice.
The paper introduces LLaFEA, a novel approach to improve fine-grained spatiotemporal reasoning in LMMs by fusing frame-based video with event camera data. LLaFEA uses cross-attention to integrate spatial features from frames and temporal features from events, followed by self-attention for global spatio-temporal associations, and embeds textual position/duration tokens into the fused visual space. Experiments on a newly constructed real-world frame-event dataset with coordinate instructions demonstrate that LLaFEA enhances spatio-temporal coordinate alignment, enabling LMMs to better interpret scenes at specific positions and times.
Event cameras can fill the temporal sparsity gap in LMMs, enabling more precise spatiotemporal reasoning.
Large multimodal models (LMMs) excel in scene understanding but struggle with fine-grained spatiotemporal reasoning due to weak alignment between linguistic and visual representations. Existing methods map textual positions and durations into the visual space encoded from frame-based videos, but suffer from temporal sparsity that limits language-vision temporal coordination. To address this issue, we introduce LLaFEA (Large Language and Frame-Event Assistant) to leverage event cameras for temporally dense perception and frame-event fusion. Our approach employs a cross-attention mechanism to integrate complementary spatial and temporal features, followed by self-attention matching for global spatio-temporal associations. We further embed textual position and duration tokens into the fused visual space to enhance fine-grained alignment. This unified framework ensures robust spatio-temporal coordinate alignment, enabling LMMs to interpret scenes at any position and any time. In addition, we construct a dataset of real-world frames-events with coordinate instructions and conduct extensive experiments to validate the effectiveness of the proposed method.