Mar 31, 2026arXiv:2603.29090

HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling

AI Summary

This paper introduces Hierarchical Causal Latent State Machines (HCLSM), a novel world model architecture designed to overcome limitations of flat latent representations in video prediction by incorporating object-centric decomposition, hierarchical temporal dynamics, and causal structure learning. HCLSM uses slot attention with spatial broadcast decoding for object segmentation, a three-level temporal engine (SSMs, sparse transformers, compressed transformers), and graph neural networks to model object interactions. Trained on the PushT benchmark, HCLSM achieves state-of-the-art next-state prediction accuracy (0.008 MSE) with emergent object decomposition, accelerated by a custom Triton kernel for SSM scans.

Key Contribution

World models can achieve state-of-the-art video prediction and emergent object decomposition by combining object-centric slots, hierarchical temporal dynamics, and learned causal interaction graphs.

Abstract

World models that predict future states from video remain limited by flat latent representations that entangle objects, ignore causal structure, and collapse temporal dynamics into a single scale. We present HCLSM, a world model architecture that operates on three interconnected principles: object-centric decomposition via slot attention with spatial broadcast decoding, hierarchical temporal dynamics through a three-level engine combining selective state space models for continuous physics, sparse transformers for discrete events, and compressed transformers for abstract goals, and causal structure learning through graph neural network interaction patterns. HCLSM introduces a two-stage training protocol where spatial reconstruction forces slot specialization before dynamics prediction begins. We train a 68M-parameter model on the PushT robotic manipulation benchmark from the Open X-Embodiment dataset, achieving 0.008 MSE next-state prediction loss with emerging spatial decomposition (SBD loss: 0.0075) and learned event boundaries. A custom Triton kernel for the SSM scan delivers 38x speedup over sequential PyTorch. The full system spans 8,478 lines of Python across 51 modules with 171 unit tests. Code: https://github.com/rightnow-ai/hclsm

Architecture Design (Transformers, SSMs, MoE)Computer Vision World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling

Related Papers