Search papers, labs, and topics across Lattice.
The paper introduces ETIA, a novel approach for generating coherent surround-view images from text prompts using a recurrent attention-based encoder-decoder architecture combined with a text-to-image diffusion model. ETIA employs a ViewNet Unet2d architecture with dual cross-attention mechanisms to align text embeddings with image latents and integrate previously generated images, ensuring both prompt adherence and sequence continuity. Experiments on nuScenes demonstrate that ETIA achieves state-of-the-art performance in image quality (FVD 99, FID 12.6) and annotation accuracy (PQ 67.4, mIoU 80.1, mAP 65.4) compared to existing methods.
Autonomous driving scene generation gets a boost: ETIA's dual cross-attention diffusion model nails both high-fidelity image synthesis and accurate semantic annotation, outperforming existing methods on nuScenes.
Generating high-fidelity surround view images from text prompts is a complex task that requires balancing contextual coherence with computational efficiency. The proposed work introduces a novel methodology that combines recurrent attention-based encoder-decoder architectures with text-to-image diffusion models to produce coherent and continuous surround view images. The approach utilizes a custom text encoder to convert input text prompts into contextual embeddings, which are then processed by the proposed ViewNet Unet2d architecture within the decoder. This architecture employs dual cross-attention mechanisms: one aligns text embeddings with corresponding noise image latents, while the other integrates previously generated image latents to ensure continuity across the sequence. This method guarantees that each generated image adheres to its specific prompt, while maintaining coherence with preceding images. In addition, an annotation decoder was introduced that generates semantic segmentation maps, instance segmentation masks, and object detection annotations. The annotation decoder processes latent image maps using a shared feature extraction backbone and dedicated heads for each annotation task. Experimental results on the nuScenes validation set demonstrate the effectiveness of the proposed model in producing high-quality contextually aligned surround view images. The proposed model achieves an FVD of 99 and an FID of 12.6, outperforming existing methods such as Panacea+ and DriveDreamer-2. Furthermore, our approach improves segmentation and detection accuracy, achieving a PQ of 67.4, mIoU of 80.1, and mAP of 65.4, surpassing methods like OpenSeeD and D2Det. An ablation study highlights the contributions of key components in our architecture. Integrating positional encoding, self-attention, and concurrent attention significantly enhances generation quality, reducing FVD to 99 and FID to 12.6. Experimental results demonstrate the effectiveness of proposed work in producing high-quality, contextually aligned surround view images with comprehensive annotations, pushing the boundaries of text-to-image synthesis and scene understanding.