Apr 6, 2026arXiv:2604.04379

Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

Weijiang Yu, Jilin Ma, Ziyu Liu, Guijian Tang, Huibin Tan, Nong Xiao

AI Summary

This paper introduces Reinforce to Learn, Elect to Reason (RLER), a dual-paradigm approach for video reasoning that decouples evidence generation from answer selection. During training, RLER uses group-relative reinforcement learning with novel rewards (frame-sensitive, think-transparency, anti-repetition) to encourage structured and verifiable reasoning traces. At inference, RLER employs a train-free orchestrator to generate diverse reasoning candidates, score them based on evidence consistency, and perform an evidence-weighted election, achieving state-of-the-art results across eight benchmarks with a 6.3% average improvement over base models.

Key Contribution

Explicitly teaching models to generate and leverage verifiable evidence during both training and inference unlocks state-of-the-art video reasoning performance, even with a small ensemble of candidates.

Abstract

Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidence-aligned. We introduce Reinforce to Learn, Elect to Reason (RLER), a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. In RLER-Training, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. In RLER-Inference, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3\% over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

Related Papers