KITJun 9, 2026arXiv:2606.10905

Beyond Model Size: Probing the Gaps in Visual in-Context Learning by Training a Tiny Model

Sunil Khatri, Steven Landgraf, Markus Ulrich, Simon Reiß

AI Summary

This paper investigates the effectiveness of a tiny visual in-context learning (VICL) model with only 1 million parameters and 70,000 images, contrasting its performance against much larger VICL models. The authors reveal that despite the significant disparity in model size, the tiny model performs competitively across various adaptive settings, highlighting deficiencies in current evaluation metrics for adaptive capabilities in vision models. This study underscores the importance of re-evaluating how adaptive learning is benchmarked, suggesting that model size and data scaling may not be the sole determinants of success in VICL.

Key Contribution

A tiny VICL model challenges the assumption that bigger is always better, revealing critical gaps in how we evaluate adaptive capabilities in vision tasks.

Abstract

Visual in-Context Learning (VICL) aims at making progress towards adaptive vision models, that can -- based on a few examples -- adapt to a new task at test-time. With the history of in-context learning in natural language processing research, where large, parameter-heavy models are in use, one pathway that current VICL methods take is model- and data-scaling as key ingredients. Yet, it is not clear, whether these ingredients are the key for in-context learning to take shape in vision models. To stress-test such large models, we challenge them with an extreme counterexample: we train a tiny visual in-context model with merely $1$ million parameters and a modest amount of $70,000$ images. We compare the results of this severely capacity capped tiny model to $7,000\times$ larger VICL models in different adaptive settings, (1) on image data with small distribution shifts, (2) on unseen task encodings and (3) on a completely new task, i.e., the setting VICL envisions. With the chasm of training resources between the tiny- and large models, our experiments showcase a lack in how adaptive capabilities are measured, with respect to how tasks are encoded, which tasks were used in pre-training and the choice of metrics. These gaps in current VICL benchmarking underscore a need for innovation in evaluation of adaptive capabilities.

Computer Vision Multimodal Models Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Model Size: Probing the Gaps in Visual in-Context Learning by Training a Tiny Model

Related Papers