Feb 18, 2026arXiv:2602.16516

Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification

Taja Kuzman Pungervsek, Taja Kuzman Pungeršek, Peter Rupnik, Peter Rupnik, Daniela Širinić, Daniela vSirini'c, Nikola Ljubešić, Nikola Ljubevsi'c

AI Summary

The paper introduces ParlaCAP, a dataset of 8 million parliamentary speeches from 28 European parliaments annotated with the Comparative Agendas Project (CAP) schema, and presents a method for building policy topic classifiers. They use a teacher-student framework where a large language model (LLM) annotates in-domain training data, which is then used to fine-tune a multilingual encoder model. The resulting classifier achieves performance comparable to human annotators and outperforms existing out-of-domain CAP classifiers, demonstrating the effectiveness of the LLM-based annotation approach.

Key Contribution

LLMs can bootstrap high-quality, domain-specific policy classifiers from multilingual parliamentary data, rivaling human annotation at a fraction of the cost.

Abstract

This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data. In addition to the CAP annotations, the ParlaCAP dataset offers rich speaker and party metadata, as well as sentiment predictions coming from the ParlaSent multilingual transformer model, enabling comparative research on political attention and representation across countries. We illustrate the analytical potential of the dataset with three use cases, examining the distribution of parliamentary attention across policy topics, sentiment patterns in parliamentary speech, and gender differences in policy attention.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References27

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification

Related Papers