Feb 25, 2026arXiv:2602.21819

SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

Minghan Yang, Ke Li, Honggang Zhang, Kaiyue Pang, Yizhe Song

AI Summary

The paper introduces SemVideo, a novel fMRI-to-video reconstruction framework that leverages hierarchical semantic information to address inconsistencies in visual representations and poor temporal coherence in existing methods. SemVideo uses a hierarchical guidance module (SemMiner) to extract static anchor descriptions, motion-oriented narratives, and holistic summaries from the video stimulus. By aligning fMRI signals with CLIP embeddings derived from SemMiner and using a tripartite attention fusion architecture for motion reconstruction, SemVideo achieves state-of-the-art performance on CC2017 and HCP datasets.

Key Contribution

Reconstructing videos from brain activity gets a major boost with SemVideo, which uses hierarchical semantic guidance to produce more coherent and accurate reconstructions than ever before.

Abstract

Reconstructing dynamic visual experiences from brain activity provides a compelling avenue for exploring the neural mechanisms of human visual perception. While recent progress in fMRI-based image reconstruction has been notable, extending this success to video reconstruction remains a significant challenge. Current fMRI-to-video reconstruction approaches consistently encounter two major shortcomings: (i) inconsistent visual representations of salient objects across frames, leading to appearance mismatches; (ii) poor temporal coherence, resulting in motion misalignment or abrupt frame transitions. To address these limitations, we introduce SemVideo, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information. At the core of SemVideo is SemMiner, a hierarchical guidance module that constructs three levels of semantic cues from the original video stimulus: static anchor descriptions, motion-oriented narratives, and holistic summaries. Leveraging this semantic guidance, SemVideo comprises three key components: a Semantic Alignment Decoder that aligns fMRI signals with CLIP-style embeddings derived from SemMiner, a Motion Adaptation Decoder that reconstructs dynamic motion patterns using a novel tripartite attention fusion architecture, and a Conditional Video Render that leverages hierarchical semantic guidance for video reconstruction. Experiments conducted on the CC2017 and HCP datasets demonstrate that SemVideo achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

Related Papers