Google ResearchMax PlanckVIA Research CenterApr 22, 2026arXiv:2604.20705

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr, Federico Tombari, Bernt Schiele

AI Summary

The paper introduces SSL-R1, a self-supervised reinforcement learning framework for post-training MLLMs that derives verifiable rewards directly from images by reformulating standard SSL tasks into visual puzzles. This approach eliminates the need for language-centric priors and manual annotations, enabling vision-centric reward design at scale. Experiments demonstrate that post-training MLLMs with SSL-R1 significantly improves performance on multimodal understanding and reasoning benchmarks.

Key Contribution

Ditch the language priors: SSL-R1 unlocks verifiable rewards for MLLM reinforcement learning directly from images, using self-supervision to solve visual puzzles.

Abstract

Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). However, the reliance on language-centric priors and expensive manual annotations prevents MLLMs' intrinsic visual understanding and scalable reward designs. In this work, we introduce SSL-R1, a generic self-supervised RL framework that derives verifiable rewards directly from images. To this end, we revisit self-supervised learning (SSL) in visual domains and reformulate widely-used SSL tasks into a set of verifiable visual puzzles for RL post-training, requiring neither human nor external model supervision. Training MLLMs on these tasks substantially improves their performance on multimodal understanding and reasoning benchmarks, highlighting the potential of leveraging vision-centric self-supervised tasks for MLLM post-training. We think this work will provide useful experience in devising effective self-supervised verifiable rewards to enable RL at scale. Project page: https://github.com/Jiahao000/SSL-R1.

Computer Vision Multimodal Models RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

Related Papers