Search papers, labs, and topics across Lattice.
This paper introduces a zero-shot reason-then-retrieve pipeline for composed video retrieval (CoVR-R) that utilizes a large language model (Qwen3.5-27B) to infer target videos from fine-grained edit instructions. By generating structured descriptions and dense embeddings for each gallery video, the model enhances retrieval accuracy through a combination of dense and TF-IDF retrieval methods. The approach achieves state-of-the-art performance on both validation and blind test splits, with notable retrieval rates of 89.73% at R@1 on the blind test set.
Achieving nearly 90% accuracy in retrieving videos based on nuanced edit instructions could redefine standards in video retrieval systems.
CoVR-R studies reason-aware composed video retrieval: given a reference video and an edit instruction, the system must retrieve the target video that satisfies the edit. The main difficulty is that the target is not described directly; it must be inferred from fine-grained changes in object identity, action order, final state, hand interaction, and scene transition. We build a zero-shot reason-then-retrieve pipeline around Qwen3.5-27B. For each gallery video, the model generates a retrieval-oriented structured description and a dense embedding by pooling generated-token hidden states with token-dependent weights. For each query, the model first performs edit reasoning over the reference video and instruction, then generates a target-video description whose hidden states serve as the query embedding. We complement dense retrieval with a TF-IDF branch over the generated texts and fuse the two rankings with split-specific weights. On validation, the current best submission reaches 80.81 at R@1, 94.86 at R@5, 97.11 at R@10, and 98.59 at R@50. On the blind test split, it reaches 89.73 at R@1, 95.79 at R@5, 96.63 at R@10, and 97.98 at R@50.