May 1, 2026arXiv:2605.08129

Towards Customized Multimodal Role-Play

Chao Tang, Jianzong Wu, Qingyu Shi, Ye Tian, Aixi Zhang, Haozhe Jiang, Jiangning Zhang, Yunhai Tong

AI Summary

The paper introduces Customized Multimodal Role-Play (CMRP), a new task focused on jointly customizing an AI character's persona, dialogue style, and visual identity while maintaining cross-modal consistency. To address this, the authors created the RoleScape-20 dataset and developed UniCharacter, a two-stage training framework involving Unified Supervised Finetuning (Unified-SFT) and character-specific group relative policy optimization (Character-GRPO). Results on RoleScape-20 demonstrate that UniCharacter, trained with only 10 images and interaction examples, significantly outperforms existing methods in generating coherent text and images aligned with the target character's persona, style, and visual identity.

Key Contribution

Forget generic chatbots – now, with just 10 images and interaction examples, you can fine-tune a model to embody a specific character with a consistent persona, dialogue style, and visual identity across text and images.

Abstract

Unified multimodal understanding and generation models enable richer human-AI interaction. Yet jointly customizing a character's persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role-Play (CMRP). We construct the RoleScape-20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text-image interactions. Building on a unified model, we devise UniCharacter, a two-stage training framework containing Unified Supervised Finetuning (Unified-SFT) and character-specific group relative policy optimization (Character-GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images. This process takes about 100 GPU hours. Experiments on the RoleScape-20 dataset show that the proposed method substantially outperforms prior approaches. Ablation studies further validate the effectiveness of our cross-modal consistency design and few-shot customization strategy. We argue that CMRP, coupled with unified modeling, provides a basis for next-generation characterful and immersive interactive agents.

Data Curation & Synthetic Data Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Towards Customized Multimodal Role-Play

Related Papers