Search papers, labs, and topics across Lattice.
LV=\{f_{t}\}_{t=1}^{L} and a question QQ with multiple-choice options O={oi}O=\{o_{i}\}, our goal is to return an answer AA supported by explicit, detailed video evidence EE. 3.2 Overall Framework We present VideoHV-Agent, a framework that consists of three stages as shown in Fig. 2: context summarization, two-step reasoning, and evidence integration. The two-step reasoning stage reformulates long video question answering as a hypothesis–verification process, which can be iterated through a self-refinement loop. The pseudo code is presented in Alg. 1. Specifically, VideoHV-Agent comprises mainly four cooperative agents. In the hypothesis generation step, a Thinker agent observes the summarized video description and proposes testable hypotheses HH for the candidate answers, while a Judger agent evaluates their quality and derives a concise clue κ\kappa that specifies what needs to be verified. In the Verification step, a Verifier agent grounds this clue in the video, collecting visual evidence EE to test it; once the clue is verified (represented by the verification state SS), an Answer agent combines the summary and the gathered evidence to produce the final answer AA. This framework enables interpretable, evidence-based reasoning over long video content. 3.2.1 Context Summarization To address redundancy and complex temporal structure in long videos, we follow prior work [31, 44] by first converting each frame into textual descriptions PvP_{v} via captioning, and then deriving a compact, query-conditioned summary PsP_{s} from the frame-level captions. Although the summarization step depends on the question and therefore cannot be performed fully offline, it is computationally lightweight compared to frame-level captioning, which requires repeated visual encoding. In previous methods, frame captions and summaries are simply concatenated into a single long context [31], and the model consumes it for both local and global reasoning, incurring a time cost linear in the number of frames. In contrast, we decouple their roles: frame-level captions are used only for clip grounding, while the concise summary is used for global reasoning in other stages. This design preserves detailed information when necessary while keeping the overall context compact and efficient to process. 3.2.2 Two-step Reasoning Given the summary, VideoHV-Agent can quickly narrow down plausible answers, but the summary alone is not sufficient to reliably resolve the question. In addition, directly reasoning over all frame-level captions is also impractical. It is time-consuming, and captions mainly describe salient content while often missing fine-grained relations or events needed for precise answering. Therefore, VideoHV-Agent adopts a two-step hypothesis–verification process. First, it reasons about what information might be missing from the summary and formulates hypotheses that explicitly imagine the potentially unseen context. Then, it performs a verification step that checks whether detailed visual evidence satisfies these hypotheses, enabling accurate and efficient long-video reasoning. Here we outline the two-step reasoning pipeline, with full methodological details provided in Sec. 3.3. Hypothesis Generation. Given the summarized video context, the Thinker agent rewrites each answer candidate oio_{i} into a testable hypothesis hih_{i} that specifies what must be true in the video for oio_{i} to hold. Directly verifying all hypotheses one by one would ignore the logical relations among them. To address this, we introduce a Judge agent that evaluates the set of hypotheses and induces a discriminative clue κ\kappa, which condenses the key differences that need to be checked in order to distinguish among them. Hypothesis Verification. Guided by clue κ\kappa, the Verifier grounds a minimal temporal context needed to evaluate it, to gather evidence to evaluate κ\kappa, invokes fine-grained tools (e.g., detailed captioning) to gather evidence, and outputs a structured status status(κ)∈{VERIFIED,PARTIAL,NOT_VERIFIED}\text{status}(\kappa)\in\{\text{VERIFIED},\text{PARTIAL},\text{NOT\_VERIFIED}\} together with a concise rationale. 3.2.3 Self-Refinement Loop To improve robustness, VideoHV-Agent incorporates a self-refinement mechanism that mirrors human hypothesis revision. When the verification status is inconclusive, we regenerate refined hypotheses and updated clues for an extra round of the reasoning stage. Two regeneration prompts are used: (i) specificity enhancement, which makes hypotheses more concrete and testable when verification fails, and (ii) discriminability enhancement, which increases semantic contrast when hypotheses overlap. This ensures that each reasoning loop progressively sharpens both the clarity of hypotheses and the precision of verification, yielding stable and logically grounded answers. 3.2.4 Evidence Integration In the final stage, all verification results are integrated to infer the most plausible answer. With the summarized context and validated evidence, the video information is sufficient to resolve the question. The Answer agent re-evaluates each candidate option, checks for conflicts with the evidence, and constructs a reasoning chain outlining what was tested, observed, and supported or refuted. The final prediction is produced through explicit, evidence-grounded reasoning. 3.3 Details of Two-step Reasoning 3.3.1 Step 1: Hypothesis Hypothesis Generation. We cast answer formation as hypothesis drafting: for each candidate option oio_{i}, the Thinker Agent produces a testable hypothesis hih_{i} that, if observed in the video, would make oio_{i} correct. The hypothesis hih_{i} specifies what must be true in the video for oio_{i} to hold, explicitly naming the salient entities/objects, actions/events, and temporal/causal constraints. Before generating hypotheses, the Thinker agent filters out ill-posed or clearly incorrect options using only the summarized context, thereby reducing irrelevant noise for the subsequent verification step and avoiding unnecessary reasoning cost. Formally, we map the option set O={oi}O=\{o_{i}\} to a hypothesis set H={hi}H=\{h_{i}\} with a one-to-one correspondence intended for later verification. Clue Generation. To enhance discriminability, the Judge agent further produces a concise clue κ\kappa for the hypothesis set HH. The clue summarizes the minimal observation that can distinguish hih_{i} from competing hypotheses, such as a specific object interaction, an event order, or a visual outcome that would hold only if hih_{i} were true. This clue serves as focused guidance for the Verifier agent in the Verification step, defining what needs to be checked and what kind of evidence would support or refute the hypothesis. 3.3.2 Step 2: Verification Temporal Localization. Given the clue κ\kappa, the Verifier uses frame-level captions to localize the most probable temporal window where the clue appears, focusing on finding clip details on decisive evidence rather than the entire video. Detailed Captioning. After selecting the timestamp range, the Verifier revisits the raw frames within the window and invokes fine-grained captioning to extract detailed evidence for verification, ensuring that the system’s capabilities are not constrained by the initial visual to text translation. Each call processes at most five frames. Clue Verification. The clue verification status for κ\kappa is one of status(κ)∈{VERIFIED,PARTIAL,NOT_VERIFIED}\text{status}(\kappa)\in\{\text{VERIFIED},\text{PARTIAL},\text{NOT\_VERIFIED}\}, accompanied by a concise rationale (timestamps, entities, relations). VERIFIED means the clue can be verified by the evidence, and an answer can be raised upon this clue together with the summarized context. The Verifier agent then synthesizes all collected evidence into a reasoning trace that documents the logical inference, which is further used by the Answer agent. PARTIAL indicates that part of the clue is verified by the observed, but requires additional evidence. Since a single round of evidence collection may be insufficient or error-prone, the Verifier can trigger additional rounds when the current assessment is inconclusive, explicitly specifying what further observations and frame ranges are needed. Then, VideoHV-Agent performs a small verification-only self-refinement loop to retrieve detailed descriptions from new timestamps and integrates them into the reasoning context. The NOT_VERIFIED status indicates that the agent finds the clue sub-optimal and should be regenerated along with the hypothesis. When the NOT_VERIFIED status is presented, the large hypothesis-verification self-refinement loop is activated. Algorithm 1 VideoHV-Agent 0: video VV; question QQ; answer options OO; frame-level caption 𝒫v\mathcal{P}_{v}; LLM FllmF_{llm}; video summarizer FvsF_{vs}; 0: hypothesis HH; clue κ\kappa; evidence EE; verification status SS; final answer AA; 1: // Context Summarization 2: 𝒫f←Fvc(V)\mathcal{P}_{f}\leftarrow F_{vc}(V) 3: 𝒫s←Fvs(𝒫f,Q)\mathcal{P}_{s}\leftarrow F_{vs}(\mathcal{P}_{f},Q) 4: // Two-Stage Reasoning 5: for t=1 to T do 6: Step 1: Hypothesis Generation 7: H←Fllm(Q,O,𝒫s,promptHypothesis)H\leftarrow F_{llm}(Q,O,\mathcal{P}_{s},prompt_{\text{Hypothesis}}) 8: κ←Fllm(H,𝒫s,promptJudge)\kappa\leftarrow F_{llm}(H,\mathcal{P}_{s},prompt_{\text{Judge}}) 9: Step 2: Hypothesis Verification 10: E←Fllm(κ,𝒫f,promptEvidence)E\leftarrow F_{llm}(\kappa,\mathcal{P}_{f},prompt_{\text{Evidence}}) 11: S←Fllm(κ,E,promptVerify)S\leftarrow F_{llm}(\kappa,E,prompt_{\text{Verify}}) 12: if S==NOT_VERIFIEDS==\text{NOT\_VERIFIED} then 13: continue 14: else 15: break 16: end if 17: end for 18: // Evidence Integration 19: A←Fllm(Q,O,𝒫s,κ,E,promptAnswer)A\leftarrow F_{llm}(Q,O,\mathcal{P}_{s},\kappa,E,prompt_{\text{Answer}}) 20: return AA 4 Experiment 4.1 Datasets and Metrics We compare VideoHV-Agent against strong zero-shot and supervised baselines under the accuracy metric, evaluating performance on three multi-choice video question answering benchmarks: (i) EgoSchema [22] is a large-scale test-only benchmark built on Ego, This work was supported by Hong Kong Research Grants Council through the Areas of Excellence (AoE) Scheme under Grant AoE/E-601/22-R and NSFC under Grant No. 62372442. (Corresponding author: C. Patrick Yue
1
0
2
2
Softmax's overconfidence in noisy-label person Re-ID can be tamed by adaptive similarity calibration and evidence propagation, leading to more robust feature learning.