Search papers, labs, and topics across Lattice.
This paper introduces ViCuR, a framework for multimodal on-policy distillation that replaces traditional answer-side privilege with visually grounded cues, ensuring that the student model can access the same visual information during inference. By employing a lightweight cue recovery module that utilizes sink-token cross-attention, ViCuR effectively aggregates task-relevant visual evidence without altering the inference interface. The framework demonstrates significant improvements over conventional answer-based distillation methods, achieving performance gains across multiple benchmarks with both 2B and 8B student models.
Teacher privilege in multimodal reasoning is redefined, showing that visually grounded cues can lead to superior performance in on-policy distillation.
On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher's supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.