Search papers, labs, and topics across Lattice.
This paper constructs the Visual In-Context Benchmark (VIBE) to evaluate the adaptation capabilities of visual in-context learning models across diverse imaging domains and tasks. By stress testing six models on 14 datasets and 12 tasks, the authors reveal significant limitations and systematic failure modes in current approaches to visual in-context learning. The findings highlight the need for more robust evaluation frameworks to better understand and improve the adaptability of these models in real-world scenarios.
Visual in-context learning models struggle with adaptation, revealing critical limitations across 106 dataset-task combinations.
Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror tasks or image domains from pre-training for which real adaptation is not required. We address this gap by constructing a broad Visual In-Context BEnchmark (VIBE) with a focus on diverse imaging domains and a wide range of tasks. With this, we are able to get a much clearer picture of the adaptive capabilities of visual in-context models when faced with new image- and task distributions. We stress test six models on $14$ datasets and $12$ tasks (in total, we explore $106$ dataset-task combinations) and compare them under a unified, reproducible evaluation protocol, in an one-shot setting. Our evaluation uncovers key insights on the state of visual in-context learning, including limitations, systematic failure modes and promising directions. To foster broader evaluation, we will openly release our VIBE toolkit.