Search papers, labs, and topics across Lattice.
3
0
4
Video-LLMs can achieve more reliable reasoning by first constructing a compact, structured representation of salient events and their causal relationships.
By injecting LLM-derived contextual cues into skeleton representations, SkeletonContext achieves state-of-the-art zero-shot action recognition, even distinguishing visually similar actions without explicit object interactions.
Forget tedious pose annotations: this text-to-video approach generates realistic acrobatic human motions by cascading a text-to-skeleton model with a pose-conditioned diffusion model.