Search papers, labs, and topics across Lattice.
The paper identifies a critical failure in Video-LLMs: directional motion blindness, where they struggle to discern basic motion directions (left, right, up, down) despite progress in temporal video understanding. Through analysis, the authors trace the issue to a "direction binding gap" where motion information is present in the model's representations but not correctly linked to verbal answer options. To address this, they introduce MoDirect, a dataset for motion direction instruction tuning, and DeltaDirect, a projector-level objective that predicts motion vectors from feature deltas, significantly improving motion direction accuracy.
Video-LLMs can ace complex video understanding but still fail at telling if something is moving left or right, revealing a surprising blind spot in their perceptual abilities.
Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near chance, with above-chance cases largely attributable to prediction biases rather than genuine direction understanding. We call this failure directional motion blindness. We localize the failure by tracing motion direction information through the Video-LLM pipeline. Motion direction remains linearly accessible from the vision encoder, projector, and LLM hidden states, but the readout fails to bind this signal to the correct verbal answer option, revealing a direction binding gap. Although synthetic motion direction instruction tuning reduces this gap on the source domain, motion direction concept vector analysis shows that visual complexity weakens the signal magnitude and limits out-of-domain generalization. We introduce MoDirect, a dataset family for motion direction instruction tuning and evaluation, and DeltaDirect, a diagnosis-driven, projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas. On MoDirect-SynBench, instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4%. On MoDirect-RealBench, DeltaDirect improves real-world motion direction accuracy by 21.9 points over the vanilla baseline without real-world tuning data, while preserving standard video-understanding performance. Code: https://github.com/KHU-VLL/DeltaDirect