Search papers, labs, and topics across Lattice.
This paper introduces a systematic evaluation of the robustness of vision foundation models to common image perturbations like JPEG compression and brightness adjustments. They define three robustness metrics, evaluate six industry-scale models (OpenAI, Meta) against nine perturbation categories, and find significant non-robustness across the board. The authors further demonstrate that these perturbations degrade downstream task performance and propose a fine-tuning method to improve robustness without compromising utility.
Vision foundation models are surprisingly brittle: common image edits can drastically alter their embeddings and tank downstream performance.
A vision foundation model outputs an embedding vector for an image, which can be affected by common editing operations (e.g., JPEG compression, brightness, contrast adjustments). These common perturbations alter embedding vectors and may impact the performance of downstream tasks using these embeddings. In this work, we present the first systematic study on foundation models'robustness to such perturbations. We propose three robustness metrics and formulate five desired mathematical properties for these metrics, analyzing which properties they satisfy or violate. Using these metrics, we evaluate six industry-scale foundation models (OpenAI, Meta) across nine common perturbation categories, finding them generally non-robust. We also show that common perturbations degrade downstream application performance (e.g., classification accuracy) and that robustness values can predict performance impacts. Finally, we propose a fine-tuning approach to improve robustness without sacrificing utility.