Mar 5, 2026arXiv:2603.04894

Differentially Private Multimodal In-Context Learning

Ivoline C. Ngong, Zarreen Reza, Joseph P. Near

AI Summary

This paper introduces Differentially Private Multimodal Task Vectors (DP-MTV), a novel framework for achieving $(\varepsilon, \delta)$-differential privacy in many-shot multimodal in-context learning. DP-MTV aggregates numerous demonstrations into compact task vectors within the activation space, using per-layer clipping and calibrated noise addition to limit sensitivity and enable unlimited inference queries with a single noise injection. Experiments across eight benchmarks and three vision-language models demonstrate that DP-MTV preserves a significant portion of in-context learning gains under privacy constraints, achieving, for example, 50% accuracy on VizWiz at $\varepsilon=1.0$ compared to 55% non-private.

Key Contribution

Unlock privacy-preserving multimodal in-context learning with DP-MTV, which distills hundreds of demonstrations into compact, private task vectors.

Abstract

Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, \delta)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At $\varepsilon=1.0$, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.

Computer Vision Constitutional AI & AI Ethics Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Differentially Private Multimodal In-Context Learning

Related Papers