Search papers, labs, and topics across Lattice.
The paper introduces ControlMLLM++, a test-time adaptation framework for frozen MLLMs that uses learnable visual prompts to enable fine-grained region-based visual reasoning without retraining. It leverages cross-modal attention maps to optimize a latent visual token modifier during inference, guiding model attention towards specific regions based on a task-specific energy function. By incorporating an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias), ControlMLLM++ achieves strong out-of-domain generalization and interpretability across diverse visual prompt types.
Steer frozen MLLMs to reason about specific image regions at test time, without any training, by optimizing visual prompts that guide cross-modal attention.
We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy function to steer model attention towards user-specified areas. To enhance optimization stability and mitigate language prompt biases, ControlMLLM++ incorporates an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias). Supporting diverse visual prompt types including bounding boxes, masks, scribbles, and points, our method demonstrates strong out-of-domain generalization and interpretability. The code is available at https://github.com/mrwu-mac/ControlMLLM.