Search papers, labs, and topics across Lattice.
This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes a scene into independent factors, each with its own latent action. FLAM addresses the limitations of monolithic inverse and forward dynamics models in complex multi-entity environments by learning independent latent actions for each factor. Experiments on both simulation and real-world datasets demonstrate that FLAM achieves superior prediction accuracy, representation quality, and facilitates downstream policy learning compared to monolithic approaches.
Factored world models can disentangle the dynamics of multiple interacting entities, leading to more controllable video generation and improved policy learning.
Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the benefits of factorized latent action models.