Search papers, labs, and topics across Lattice.
Apple Abstract Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31%\% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis — e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0%\% gain on POPE. Project website: https://jwmao1.github.io/Kestrel_project/ 1 Introduction Recent advances in large-scale pretraining [26, 1, 34] and multimodal instruction tuning [23, 8] have substantially improved the capabilities of large vision-language models (LVLMs) [10, 3, 31] on multimodal understanding and reasoning tasks such as visual question answering (VQA). However, LVLMs still exhibit hallucination, producing responses that are inconsistent with or weakly supported by the input image. For example, empirical studies [20, 29, 28] show that this issue remains prevalent, making hallucination a central challenge for improving the reliability of LVLMs. To mitigate hallucination, two broad classes of methods have been proposed, training-based and training-free. For the training-based line of work, continual training with hallucination annotations or alignment with external feedback has been shown effective [36, 7, 15, 28, 24]. However, these training-based solutions incur significant data and compute overhead, posing hurdles in real-world deployment. Existing training-free methods for hallucination mitigation improve test-time correction without additional training, but leave key gaps to be filled: (i) limited gains and robustness when operating purely on internal decoding dynamics without external grounding evidence, and (ii) limited reliability when correction is performed in a single pass. Distribution-contrast methods [17, 32] can reduce object hallucinations but remain sensitive to perturbations and often favor common-object representations. Many approaches rely on internal logit dynamics [13] or language-level decoding control [12], which can yield brittle corrections that are difficult to validate against concrete visual evidence. On the other hand, methods that introduce external verification may produce non-deterministic evidence due to randomness of tools [35]. While other methods [33] could collect reliable evidence, their one-time verification-and-update can be insufficient to prevent over-correction under challenging cases. Figure 2: Kestrel vs. prior training-free hallucination mitigation methods. By combining an external grounding agent with iterative self-improvement, Kestrel collects explicit visual evidence and further converts tool outputs into structured textual evidence for verification. This design yields more interpretable and stable evidence, reduces overconfident corrections, and avoids biased interpretation that may arise when LVLMs rely only on raw visual evidence compared with prior approaches. Building on these limitations, we propose Kestrel (see Fig. LABEL:fig_teaser), a training-free framework for LVLM hallucination mitigation that unifies an explicit visual grounding agent with evidence-driven iterative self-refinement. Specifically, Kestrel first decomposes the question into verifiable claim-level targets (e.g., existence, color, count, and position), and then invokes SAM3 [4] around each target to collect segmentation overlays, bounding boxes, target crop-and-zoom views, and text evidence derived from these collected visual evidence, all of which are collated as structured evidence items with citation identifiers. The framework then performs claim-level verification with verdicts and outputs confidence-aware verification results, forming an auditable evidence chain. To regulate the potential over-correction, we further introduce an evidence-gated update scheme into the iteration: the framework progressively supplements and strengthens claim-level evidence through multiple rounds of verification and revision, and permits answer flips only when evidence strength, confidence, and evidence coverage jointly satisfy predefined criteria. These designs preserve the training-free methods while improving the interpretability, robustness, and decision stability of hallucination mitigation. Experiments show that Kestrel remains fully training-free, yet consistently reduces hallucinations at test time across multiple benchmarks, with improvements that transfer across different state-of-the-art LVLM backbones. On POPE [20] (MS-COCO, A-OKVQA, and GQA), Kestrel improves accuracy by an average of +3.31%\% points over Qwen3-VL and +3.03%\% over InternVL3.5; it also surpasses prior training-free mitigation baselines by +1.38%\% and +1.47%\% points on average under the same backbones, respectively. On the more challenging MME-Hallucination [9], Kestrel boosts Qwen3-VL by +28.34 points and exceeds OPERA [12] by +16.67, delivering consistent gains across diverse hallucination types (existence, count, and position) while maintaining strong overall performance and setting a new state-of-the-art. Our main contributions are summarized as follows: ∙\bullet We propose Kestrel, a training-free LVLM hallucination mitigation framework that unifies explicit visual grounding agent with iterative self-refinement at test time. Kestrel decomposes answers into verifiable claim-level targets, grounds them with structured visual and textual evidence, performing conservative multi-round verification and revision to improve interpretability and reduce over-correction. ∙\bullet Kestrel achieves state-of-the-art performance in hallucination mitigation on POPE and the more fine-grained MME-Hallucination. ∙\bullet Kestrel generalizes across multiple state-of-the-art LVLM backbones with substantial and consistent gains, showing that the framework is backbone-agnostic and broadly applicable in the training-free setting. 2 Related Work 2.1 Large Vision-Language Models Large vision-language models (LVLMs) have advanced rapidly through large-scale multimodal pretraining and instruction tuning, achieving strong performance across multimodal understanding and reasoning tasks. Representative paradigms include CLIP-style vision-language pretraining [26], Flamingo-style few-shot multimodal modeling [1], and BLIP-2 style [18] modular alignment between frozen vision encoders and LLMs. LVLMs, such as LLaVA [23, 22], InstructBLIP [8], OpenFlamingo [2], CogVLM [30], Kosmos-2 [25], and recent models [3, 31, 10, 5, 6], further demonstrate the effectiveness of scalable multimodal alignment and visual instruction tuning. Meanwhile, grounded multimodal modeling has become increasingly important, as exemplified by Kosmos-2 [25], which explicitly supports phrase grounding and visual referring. Nevertheless, current LVLMs still struggle to maintain faithful grounding between generated responses and image content, especially in fine-grained reasoning scenarios, making hallucination a persistent challenge for reliable deployment. 2.2 Hallucination in LVLMs Hallucination is a persistent problem in large vision-language models (LVLMs). Early studies [20] show that LVLMs often generate content inconsistent with the input image, especially by predicting non-existent objects, while POPE [20] improves the stability of such evaluation. Subsequent work shows that hallucination extends beyond object existence to finer-grained errors in attributes, counts, and relations, as benchmarked by AMBER [29]. More challenging settings, such as visual illusion and ambiguous local evidence, are further explored in HallusionBench [11]. Broader benchmarks including MME [9], MMHal-Bench [28], and THRONE [16] further suggest that hallucination is heterogeneous, benchmark-sensitive, and closely tied to failures in visual grounding and multimodal reasoning. These findings motivate mitigation methods that verify model outputs against explicit and fine-grained visual evidence. 2.3 Training-based Hallucination Mitigation Early approaches improve faithfulness by redesigning instruction data or supervision signals so that models better distinguish grounded from ungrounded responses. For example, robust visual instruction tuning [7] introduces hallucination-oriented supervision, while HACL [15] uses contrastive learning to separate grounded and hallucinated representations. Reflective instruction tuning [36] further improves reliability by adding rationale supervision. Alignment-based methods, such as factually augmented RLHF [28] and Silkie [19], incorporate preference or factual signals during post-training, and HIO [24] strengthens token-level contrastive learning around hallucinated content. Overall, training-based methods are effective, but they usually require additional annotations, synthetic data, preference collection, or repeated optimization, leading to higher training cost and deployment complexity. Figure 3: Overview of Kestrel. Given an image-question pair, Kestrel follows a training-free four-stage pipeline for LVLM hallucination mitigation: (1) Initialization, which obtains an initial answer and rewrites it into question-aligned verifiable claims with associated visual entities and claim types; (2) Agent Grounding, which invokes an external SAM3-based grounding agent to collect explicit visual evidence (e.g., segmentation overlays, boxes, and crop-and-zoom views) and convert them into structured textual evidence; (3) Claim-level Verification, which verifies each claim against the cited evidence to produce claim-wise verdicts, confidence scores, and a top-level verification decision; and (4) Self-Refinement, which performs evidence-gated answer updating based on the current and previous verification traces. 2.4 Training-free Hallucination Mitigation Training-free hallucination mitigation aims to reduce hallucination at inference time without updating model parameters. A major line of work focuses on contrastive or decoding-based strategies, such as VCD [17], RITUAL [32], OPERA [12], and SHIELD [13], which alleviate hallucination by intervening on decoding behavior or visual token representations. Another line introduces explicit verification or post-hoc correction. For example, Woodpecker [33] adopts a multi-stage correction pipeline, while DeGF [35] leverages text-to-image generative feedback for iterative refinement. Meanwhile, recent grounding models such as SAM3 [4] make it increasingly practical to collect explicit visual evidence at inference time. Compared with prior training-free methods, our work further emphasizes combine explicit grounding evidence and conservative iterative self-refinement to mitigate hallucination. 3 Method We propose Kestrel, a training-free framework for mitigating LVLM hallucination with explicit visual grounding agent and structured evidence-driven self-refinement at test time. Given an image ℐ\mathcal{I} and corresponding question 𝒬\mathcal{Q}, Kestrel iteratively follows a four-step pipeline: (i) initialization, (ii) agent grounding, (iii) claim-level verification, and (iv) self-refinement (see Fig. 3). 3.1 Initialization Kestrel first queries the LVLM to obtain an initial answer 𝒜^(0)\hat{\mathcal{A}}^{(0)}. To support claim-level verification, Kestrel converts 𝒬\mathcal{Q} into a small set of verifiable claims that directly correspond to the question. Concretely, Kestrel rewrites the question-answer decision into visually checkable claims, each anchored to one or two concrete visual entities. These entities serve as the detection targets for the grounding agent. Meanwhile, based on the verifiable attributes required by the question, we categorize the extracted claims (e.g., existence, color, count, position) to route subsequent agent grounding. 3.2 Agent Grounding To obtain explicit, inspectable grounding evidence, Kestrel invokes an external visual grounding agent built on SAM3 promptable concept segmentation [4]. Visual evidence. SAM3 takes the visual entities in the claims as concept prompts and returns the matched instances, from which Kestrel collects explicit visual evidence, including: (i) segmentation overlays for transparent localization, (ii) instance bounding boxes (derived from SAM3 masks) to support geometry-based reasoning, and (iii) crop-and-zoom views around predicted instances to reduce ambiguity for attribute inspection (e.g., color) and local details. Structured textual evidence. To make agent outputs directly usable for claim verification and auditable diagnosis, Kestrel derives textual evidence from visual evidences via LVLM for each claim type: (i) for existence, we convert the predicted instances into an existence statement by checking whether the number of matched instances is greater than zero; (ii) for count, we report the instance count computed from the number of predicted masks; (iii) for color, we generate a concise color observation conditioned on the masked crop-and-zoom and the full image; (iv) for position, we convert SAM3 geometry into text by deriving coarse spatial cues from the union bounding box and, when two entities are involved, computing their relative relation from the corresponding bounding-box centers. Each textual evidence item is paired with a citation identifier and can be referenced during verification and answer revision. 3.3 Claim-level Verification Given the claims and corresponding structured evidence items, Kestrel performs claim-level verification using LVLM-as-a-judge. The judge is instructed to base its decision only on the provided evidence and to cite the corresponding evidence. For each claim, the verifier outputs: (i) a verdict (supported / contradicted / insufficient), (ii) a confidence score, and (iii) a short reasoning that must cite the relevant evidence. We then consolidate claim-wise judgments into a top-level verification verdict for the current answer: it is labeled as contradicted if any claim is confidently refuted with cited evidence, as supported only when all claims are confidently supported, and as insufficient otherwise. The resulting verification trace constitutes an explicit, evidence-grounded audit trail, enabling interpretable analysis of when hallucinations arise and how corrections are triggered. 3.4 Self-Refinement Since external agent and the LVLM may be unrobust, directly revising the answer based on the instance-level verdict can introduce over-correction. Therefore, Kestrel adopts a evidence-gated self-refinement strategy: it permits correction only when the verification provides sufficiently reliable signals—i.e., high-confidence claim-level judgments together with cited evidence for the corresponding claim. Otherwise, Kestrel preserves the current answer 𝒜^(i)\hat{\mathcal{A}}^{(i)} (where ii denotes ii th iteration) and proceeds to collect stronger evidence in subsequent rounds. Importantly, the self-refinement is stateful: the revision step conditions not only on the current verification results but also on prior rounds’ claims, evidence, and decisions. Based on the verification trace, Kestrel updates the answer to obtain 𝒜^(i+1)\hat{\mathcal{A}}^{(i+1)} and proposes a new set of claims for the next iteration, prioritizing claims that remain uncertain or are implicated by contradictions. This iterative process progressively strengthens evidence and stabilizes decision-making, while remaining training-free. Kestrel repeats the cycle for a small number of iterations, and stops early when the answer stabilizes under consistently supportive verification, or when additional iterations no longer yield stronger evidence. The final output is the answer together with its claim-level verification traces. 4 Experiments 4.1 Experimental Setup Table 1: Results on POPE [20] benchmark. Higher (↑\uparrow) accuracy indicates better performance. The best results are bolded, and the second-best are underlined. Backbone Method MS-COCO [21] A-OKVQA [27] GQA [14] Rand. ↑\uparrow Pop. ↑\uparrow Adv. ↑\uparrow Rand. ↑\uparrow Pop. ↑\uparrow Adv. ↑\uparrow Rand. ↑\uparrow Pop. ↑\uparrow Adv. ↑\uparrow Qwen3-VL [3]
Apple ML Research1
0
2
LVLMs can be made significantly less prone to hallucinations, without any training, by explicitly grounding them in visual evidence and iteratively self-refining their answers based on verified information.