Search papers, labs, and topics across Lattice.
The paper introduces Recipe Diffusion, a framework for generating coherent visual recipe instructions by enforcing cross-frame consistency and region-aware noise application. They modify attention layers to share key-value pairs across frames, promoting global consistency, and apply noise differentially based on object regions to preserve object identity while allowing contextual variations. Experiments demonstrate the framework's ability to generate recipe instruction sequences with improved coherence and object fidelity compared to independent frame generation approaches.
Achieve coherent, step-by-step visual recipe generation without training by intelligently sharing information across frames and applying noise selectively.
This paper presents a cross-frame attention and region-aware diffusion method for generating coherent, step-by-step visual instructions for cooking recipes. Our approach combines two complementary mechanisms: (1) cross-frame keyvalue sharing in attention layers to maintain global consistency across sequential frames, and (2) region-aware noise application, which preserves object identity while allowing contextual changes. Unlike conventional models that generate each image independently, our training-free framework leverages pre-trained detection and segmentation models to create region masks and modifies the attention mechanism to share visual features across frames. By integrating differential noise application with crossframe attention consistency, our system generates recipe instruction sequences that maintain both global coherence and local object identity throughout each step.