Search papers, labs, and topics across Lattice.
This paper introduces CLEAR, a novel framework for end-to-end autonomous driving that integrates ultra-fast generative planning with deep semantic reasoning to overcome the latency issues of traditional diffusion models. By employing a single-step conditional drift in a VAE latent space and fine-tuning a visual encoder on driving QA pairs, CLEAR effectively balances maneuver diversity and expert precision. The framework achieves a state-of-the-art performance on the NAVSIM v1 benchmark, demonstrating that efficient, high-fidelity multi-modal planning is feasible without relying on dense geometric annotations or iterative sampling.
Achieving 93.7 on the NAVSIM v1 benchmark, CLEAR redefines the efficiency of multi-modal planning in autonomous driving without the need for complex iterative processes.
End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient $伪$ and sample count $N$ from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.