Search papers, labs, and topics across Lattice.
This paper introduces a QA generation pipeline aimed at enhancing Vision Language Models (VLMs) by improving their understanding of 4D scenes, which are complicated by the entanglement of object and camera motion. By implementing a novel True-Motion Tracking system alongside traditional tracking methods, the authors create a large-scale dataset of 400K samples, known as 4DP-QA, and a benchmark set of 2.2K samples, 4DP-QA-Bench. Training existing models on this dataset leads to significant performance improvements on external benchmarks, demonstrating the effectiveness of the proposed approach in addressing the challenges of motion-related scene understanding.
VLMs trained on the new 4DP-QA dataset show marked improvements in understanding complex 4D scenes, revealing the critical role of disentangling motion dynamics.
Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection onto 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these challenges, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate a large-scale training dataset of 400K samples, 4DP-QA (4D Perception QA), and a 2.2K-sample benchmark, 4DP-QA-Bench. Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.