NJUMar 20, 2026arXiv:2604.13074

PersonaVLM: Long-Term Personalized Multimodal LLMs

Chang Nie, Chaoyou Fu, Yifan Zhang, Haihua Yang, Caifeng Shan

AI Summary

PersonaVLM is introduced as a framework to enable long-term personalization in MLLMs by incorporating memory, reasoning, and response alignment capabilities. The framework extracts and summarizes multimodal memories, retrieves relevant memories for multi-turn reasoning, and infers user personality to align outputs. Evaluated on a new benchmark, Persona-MME, PersonaVLM demonstrates significant improvements over baselines and even outperforms GPT-4o in long-term personalization tasks.

Key Contribution

Forget static, single-turn personalization – PersonaVLM unlocks long-term, evolving user alignment in MLLMs, even surpassing GPT-4o.

Abstract

Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited. Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users'evolving preferences and personality over time (see Fig.1). In this paper, we introduce PersonaVLM, an innovative personalized multimodal agent framework designed for long-term personalization. It transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities: (a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database. (b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. (c) Response Alignment: It infers the user's evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics. For evaluation, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks. Extensive experiments validate our method's effectiveness, improving the baseline by 22.4% (Persona-MME) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively. Project page: https://PersonaVLM.github.io.

Multimodal Models RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PersonaVLM: Long-Term Personalized Multimodal LLMs

Related Papers