Search papers, labs, and topics across Lattice.
This paper introduces a vision-language model (VLM) pipeline for transcribing, semantically segmenting, and performing entity linking on historical Italian parliamentary speeches from scanned documents. The pipeline combines a specialized OCR model with a large-scale VLM to refine transcriptions, classify document elements, and identify speakers by jointly processing visual layout and text. Evaluation on a benchmark dataset shows significant improvements in transcription accuracy and speaker tagging compared to traditional OCR methods.
VLMs can unlock insights from troves of historical documents previously inaccessible due to OCR limitations, achieving state-of-the-art transcription and speaker tagging of Italian parliamentary speeches.
Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents. Existing efforts to transcribe Italian parliamentary speeches have relied on traditional Optical Character Recognition pipelines, resulting in transcription errors and limited semantic annotation. In this paper, we propose a pipeline based on Vision-Language Models for the automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches. The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content. Extracted speakers are then linked to the Chamber of Deputies knowledge base through SPARQL queries and a multi-strategy fuzzy matching procedure. Evaluation against an established benchmark demonstrates substantial improvements both in transcription quality and speaker tagging.