Minghan Yang

Bexp⁡(fSADMLP(X′)⋅Z(CLk)τ))],\small\begin{split}\mathcal{L}_{\text{SoftCLIP}}&=-\sum_{j=1}^{B}\Biggl[\frac{\exp\Bigl(\frac{Z(C_{L})\cdot Z(C_{L}^{j})}{\tau}\Bigr)}{\sum_{k=1}^{B}\exp\Bigl(\frac{Z(C_{L})\cdot Z(C_{L}^{k})}{\tau}\Bigr)}\\ &\hskip 18.49988pt\times\log\Bigl(\frac{\exp\Bigl(\frac{f_{\text{SAD}}^{\textit{MLP}}(X^{\prime})\cdot Z(C_{L}^{j})}{\tau}\Bigr)}{\sum_{k=1}^{B}\exp\Bigl(\frac{f_{\text{SAD}}^{\textit{MLP}}(X^{\prime})\cdot Z(C_{L}^{k})}{\tau}\Bigr)}\Bigr)\Biggr],\end{split} (5) ℒrefine=𝔼t∼[1,T]‖fSADRefine(ZLt,t,fSADMLP(X′))−Z(CL)‖2,\small\mathcal{L}_{\text{refine}}\!=\!\mathbb{E}_{t\sim[1,T]}\left\|f_{\text{SAD}}^{\textit{Refine}}\!\left(Z_{L}^{t},t,f_{\text{SAD}}^{\textit{MLP}}(X^{\prime})\right)\!-\!Z(C_{L})\right\|^{2}\!, (6) ℒSAD\displaystyle\mathcal{L}_{\text{SAD}} =λrefineℒrefine+λSoftCLIPℒSoftCLIP+ℒMSE,\displaystyle=\lambda_{\text{refine}}\,\mathcal{L}_{\text{refine}}+\lambda_{\text{SoftCLIP}}\,\mathcal{L}_{\text{SoftCLIP}}+\mathcal{L}_{\text{MSE}}, (7) where BB denotes the batch size, jj indexes the jj-th sample in the batch, and τ\tau is a temperature hyperparameter. TT represents the total number of denoising timesteps, and ZLtZ_{L}^{t} is the perturbed semantic embedding of CLC_{L} at timestep tt. λrefine\lambda_{\text{refine}} and λSoftCLIP\lambda_{\text{SoftCLIP}} are weighting coefficients that balance the contributions of the corresponding loss components. Figure 3: Overview of the SemVideo inference pipeline. fMRI signals are first decoded into Z^(CL)\hat{Z}(C_{\text{L}}) by the SAD. Z^(Cmotion)\hat{Z}(C_{\text{motion}}) conditions the MAD to refine frame embeddings E^(x)\hat{E}(x), which are passed through a VAE decoder, generating a blurry video. E^(x)\hat{E}(x) and Z^(Canchor)\hat{Z}(C_{\text{anchor}}) guide the SD model to generate anchor frame, combined with the blurry video and Z^(Choli)\hat{Z}(C_{\text{holi}}), is fed into a T

Papers on Lattice

Total citations

Topics

Research focus

Computer Vision (1)Multimodal Models (1)

Frequent co-authors

Honggang Zhang (1)Kaiyue Pang (1)Yizhe Song (1)

Papers (1)

Feb 25, 2026

Feb 25, 2026·also BUPT

SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

Reconstructing videos from brain activity gets a major boost with SemVideo, which uses hierarchical semantic guidance to produce more coherent and accurate reconstructions than ever before.

Minghan Yang, Honggang Zhang, Kaiyue Pang +1

Computer Vision Multimodal Models

Search

Minghan Yang

Research focus

Frequent co-authors

Papers (1)