Search papers, labs, and topics across Lattice.
This paper introduces DFAlign, a novel framework for Open-Vocabulary Temporal Action Detection (OV-TAD) that uses diffusion-based denoising to generate foreground knowledge and improve action-video alignment. DFAlign employs a Semantic-Unify Conditioning (SUC) module to unify action semantics for denoising, a Background-Suppress Denoising (BSD) module to extract foreground knowledge by removing background redundancy, and a Foreground-Prompt Alignment (FPA) module to inject this knowledge into text representations. Experiments on OV-TAD benchmarks demonstrate state-of-the-art performance, highlighting the effectiveness of foreground knowledge prompting for cross-modal alignment.
Diffusion models can bridge the semantic gap between abstract action labels and complex video content, leading to state-of-the-art performance in open-vocabulary temporal action detection.
Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action semantics and video representations is critical for accurate detection. However, existing methods struggle to mitigate the semantic imbalance between concise, abstract action labels and rich, complex video contents, inevitably introducing semantic noise and misleading cross-modal alignment. To address this challenge, we propose DFAlign, the first framework that leverages diffusion-based denoising to generate foreground knowledge for the guidance of action-video alignment. Following the 'conditioning, denoising and aligning' manner, we first introduce the Semantic-Unify Conditioning (SUC) module, which unifies action-shared and action-specific semantics as conditions for diffusion denoising. Then, the Background-Suppress Denoising (BSD) module generates foreground knowledge by progressively removing background redundancy from videos through denoising process. This foreground knowledge serves as effective intermediate semantic anchor between video and text representations, mitigating the semantic gap and enhancing the discriminability of action-relevant segments. Furthermore, we introduce the Foreground-Prompt Alignment (FPA) module to inject extracted foreground knowledge as prompt tokens into text representations, guiding model's attention towards action-relevant segments and enabling precise cross-modal alignment. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two OV-TAD benchmarks. The code repository is provided as follows: https://anonymous.4open.science/r/Code-2114/.