Search papers, labs, and topics across Lattice.
This paper introduces Visual On-Policy Self-Distillation (Visual-OPSD), a method that leverages a teacher-student framework where both models share weights but differ in context to enhance unified multimodal reasoning. The authors demonstrate that while traditional multimodal models incur significant inference costs with limited accuracy benefits, Visual-OPSD achieves a substantial improvement of +3.40 percentage points in accuracy with a 14.3x speedup on the ThinkMorph benchmark. Notably, the method outperforms same-scale visual language models by +63.83 percentage points, indicating that the reasoning encoded in the generation pathway is crucial for performance enhancement.
Visual-OPSD achieves a remarkable 14.3x speedup while boosting accuracy by over 3 percentage points, revealing the untapped potential of reasoning in visual thought generation.
Unified multimodal models (UMMs) interleave generated ''visual thoughts'' (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost yields limited direct benefit. On ThinkMorph, removing or noising VTs barely changes accuracy across nine benchmarks. Once rendered, attention concentrates on the VT regardless of content. Yet a KL diagnostic shows that conditioning on a privileged VT trace shifts the model's completion distribution. This suggests the generation pathway encodes useful reasoning beyond the rendered pixels. Motivated by this gap, we propose Visual On-Policy Self-Distillation(Visual-OPSD). Teacher and student share identical weights but differ in context: the teacher sees privileged VTs while the student sees only the question. Token-level JSD distillation on on-policy student trajectories transfers the teacher's reasoning to a text-only student. Across nine benchmarks, Visual-OPSD improves over its generative teacher by $+3.40$pp with $14.3\times$ speedup (10.0s vs. 142.8s per sample) and outperforms same-scale VLMs by $+63.83$pp on VSP. A Gaussian-noise control ($+0.40$pp vs. $+10.28$pp for real VTs) and $58.4\%$ closure of the KL gap confirm that gains come from the semantic content of the generation pathway.