Search papers, labs, and topics across Lattice.
The paper investigates whether video diffusion models retain information about physical plausibility during the denoising process. By probing the intermediate layers of a Diffusion Transformer (DiT), the authors find that physically plausible and implausible videos are separable in the feature space. They then introduce "progressive trajectory selection," an inference-time strategy that uses a lightweight physics verifier trained on DiT features to prune implausible denoising trajectories, improving physical consistency and reducing inference cost.
Surprisingly, video diffusion models contain recoverable physics-related cues in their intermediate denoising representations, enabling more physically plausible video generation with reduced computational cost.
Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.