Mar 30, 2026arXiv:2603.28103

Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

L. Curini, Luigi Curini, Alfio Ferrara, Giovanni Pagano, S. Picascia, Sergio Picascia

AI Summary

This paper introduces a vision-language model (VLM) pipeline for transcribing, semantically segmenting, and performing entity linking on historical Italian parliamentary speeches from scanned documents. The pipeline combines a specialized OCR model with a large-scale VLM to refine transcriptions, classify document elements, and identify speakers by jointly processing visual layout and text. Evaluation on a benchmark dataset shows significant improvements in transcription accuracy and speaker tagging compared to traditional OCR methods.

Key Contribution

VLMs can unlock insights from troves of historical documents previously inaccessible due to OCR limitations, achieving state-of-the-art transcription and speaker tagging of Italian parliamentary speeches.

Abstract

Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents. Existing efforts to transcribe Italian parliamentary speeches have relied on traditional Optical Character Recognition pipelines, resulting in transcription errors and limited semantic annotation. In this paper, we propose a pipeline based on Vision-Language Models for the automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches. The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content. Extracted speakers are then linked to the Chamber of Deputies knowledge base through SPARQL queries and a multi-strategy fuzzy matching procedure. Evaluation against an established benchmark demonstrates substantial improvements both in transcription quality and speaker tagging.

Multimodal Models Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

Related Papers