LuxembourgApr 13, 2026arXiv:2604.12047

Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

Omar El Bachyr, Yewei Song, Tegawendé F. Bissyandé, Anas Zilali, Ulrick Ble, Anne Goujon

AI Summary

This paper benchmarks the performance of various PDF parsers and chunking strategies within Retrieval-Augmented Generation (RAG) systems for financial question answering. The authors introduce TableQuest, a new financial QA benchmark, and evaluate different parser/chunker combinations on it and an existing benchmark. Results show significant performance variation based on parser and chunking strategy, offering practical guidance for building effective PDF-based RAG pipelines.

Key Contribution

Naive PDF parsing and chunking can severely bottleneck RAG performance on financial documents; careful selection yields substantial gains.

Abstract

PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To address these difficulties, both practitioners and researchers are increasingly developing new methods, including the promising Retrieval-Augmented Generation (RAG) systems to automated PDF processing. However, there is no comprehensive study investigating how different components and design choices affect the performance of a RAG system for understanding PDFs. In this paper, we propose such a study (1) by focusing on Question Answering, a specific language understanding task, and (2) by leveraging two benchmarks from the financial domain, including TableQuest, our newly generated, publicly available benchmark. We systematically examine multiple PDF parsers and chunking strategies (with varied overlap), along with their potential synergies in preserving document structure and ensuring answer correctness. Overall, our results offer practical guidelines for building robust RAG pipelines for PDF understanding.

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

Related Papers