Mar 16, 2026arXiv:2603.15132

WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

AI Summary

Waypoint Diffusion Transformers (WiT) address trajectory conflicts in pixel-space diffusion models by introducing intermediate semantic waypoints projected from pre-trained vision models. WiT factorizes the continuous vector field into prior-to-waypoint and waypoint-to-pixel segments, disentangling generation trajectories. By dynamically inferring waypoints during denoising and conditioning the diffusion transformer via Just-Pixel AdaLN, WiT achieves improved performance and accelerates training convergence by 2.2x on ImageNet 256x256 compared to pixel-space baselines.

Key Contribution

By cleverly inserting semantic waypoints into pixel-space diffusion, WiT achieves faster convergence and better image quality without resorting to information-lossy latent representations.

Abstract

While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at https://github.com/hainuo-wang/WiT.git.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

Related Papers