Search papers, labs, and topics across Lattice.
This paper introduces a novel framework for training-free composed video retrieval that leverages Visual Representation-Guided Video-LLM Reasoning. By utilizing frozen DINOv3 models to generate a compact set of visually relevant candidates and employing large vision-language models to assess compliance with modification instructions, the system achieves impressive retrieval performance. The approach yields a Recall@1 of 48.78 and Recall@5 of 51.48 on the test set, highlighting its effectiveness in complex video retrieval scenarios without the need for additional training.
Achieving nearly 50% Recall@1 in video retrieval without any training marks a significant leap in efficiency and effectiveness for complex user queries.
Recent advances in large vision-language models have expanded video retrieval from simple text-based search to more flexible scenarios, where users may specify the desired result through both visual examples and textual instructions. In the CVPR 2026 Reason-Aware Composed Video Retrieval Challenge, the system is required to retrieve a target video according to a reference video and a modification instruction. To address this task, we develop Visual Representation-Guided Video-LLM Reasoning for Training-Free Composed Video Retrieval. Our framework first uses frozen DINOv3 models to obtain a compact set of visually relevant candidates, and then applies large vision-language models to evaluate whether each candidate satisfies the modification instruction. A final reasoning-based refinement is further performed on the top candidates to improve the first-ranked prediction. Without training, our system achieves 48.78 Recall@1 and 51.48 Recall@5 on the test set. Future work may further improve retrieval accuracy through stronger video-LLMs and detailed integration between visual representations and language reasoning.