Search papers, labs, and topics across Lattice.
3
0
6
1
Coordinated typographic attacks across modalities can more than double the success rate of misleading audio-visual MLLMs compared to single-modality attacks.
VLMs, despite excelling at semantic tasks, are surprisingly brittle when faced with basic geometric transformations like rotations and scaling.
Diffusion Transformers get a 3x speed boost without sacrificing image quality, thanks to a clever trick of dynamically adjusting patch sizes during denoising.