HKUPKUTencent AIMay 2, 2026arXiv:2605.01284

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

Peiyang Liu, Ziqiang Cui, Xi Wang, Di Liang, Wei Ye

AI Summary

The paper introduces Chain of Evidence (CoE), a retriever-agnostic visual attribution framework for iterative Retrieval-Augmented Generation (iRAG) that operates directly on document screenshots using Vision-Language Models. CoE addresses limitations of text-based iRAG by providing pixel-level attribution and preserving visual semantics, enabling reasoning over visually rich documents without format-specific parsing. Experiments on Wiki-CoE and SlideVQA datasets demonstrate that a fine-tuned Qwen3-VL-8B-Instruct model significantly outperforms text-based baselines, especially in tasks requiring visual layout understanding.

Key Contribution

Forget sifting through walls of text – now you can pinpoint exactly where the AI found its answer, down to the pixel, even in complex visuals like charts and diagrams.

Abstract

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbf{Chain of Evidence (CoE)}, a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: \textbf{Wiki-CoE}, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and \textbf{SlideVQA}, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.

Computer Vision Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References56

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

Related Papers