MonashTaobaoApr 6, 2026arXiv:2604.04415

Structured Causal Video Reasoning via Multi-Objective Alignment

Zinuo Li, Zinuo Li, Yongxin Guo, Yongxin Guo, Jun Liu, Jiawei Zhan, Jiawei Zhan, Xi Jiang, Xi Jiang, Chengjie Wang, Chengjie Wang, Mohammed Bennamoun, Mohammed Bennamoun, F. Boussaid, Farid Boussaid, Feng Zheng, Qiuhong Ke, Qi Ke

AI Summary

This paper introduces Structured Event Facts, a compact representation of salient video events and their causal relationships, to improve video reasoning in Video-LLMs. They train Factum-4B, a 4B parameter model, using a four-stage pipeline including facts alignment, format warm-start, thinking warm-start, and reinforcement learning. To address competing objectives during RL (structural completeness vs. reasoning length), they formulate optimization as a Multi-Objective Reinforcement Learning (MORL) problem, achieving stronger performance on video understanding tasks requiring temporal inference.

Key Contribution

Video-LLMs can achieve more reliable reasoning by first constructing a compact, structured representation of salient events and their causal relationships.

Abstract

Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Structured Causal Video Reasoning via Multi-Objective Alignment

Related Papers