Search papers, labs, and topics across Lattice.
This paper benchmarks the ability of six commercial LLMs (GPT-5-mini, GPT-5-chat, Claude Haiku 4.5, Llama 4 Maverick, Llama 3.3 70B, and Llama 3.1 8B) to generate legislative reasoning comparable to official Romanian Senate explanatory memoranda. Using LLM-as-Judge semantic scoring and text similarity metrics, the study finds a performance gap between frontier models and open-weight models, with all models exhibiting task-dependent confabulation, particularly on politically idiosyncratic proposals. The authors introduce "cascading bounded rationality" to describe the compounding failures across principals, agents, and evaluators, highlighting contextual ignorance as a key risk for legislators.
LLMs can mimic legislative reasoning, but their performance hinges on the proposal's idiosyncrasy, revealing a susceptibility to plausible-sounding confabulation that could mislead policymakers.
This paper evaluates whether commercial large language models (LLMs) can function as reliable political advisory tools by comparing their outputs against official legislative reasoning. Using a dataset of 15 Romanian Senate law proposals paired with their official explanatory memoranda (expuneri de motive), we test six LLMs spanning three provider families and multiple capability tiers: GPT-5-mini, GPT-5-chat (OpenAI), Claude Haiku 4.5 (Anthropic), and Llama 4 Maverick, Llama 3.3 70B, and Llama 3.1 8B (Meta). Each model generates predicted rationales evaluated through a dual framework combining LLM-as-Judge semantic scoring and programmatic text similarity metrics. We frame the LLM-politician relationship through principal-agent theory and bounded rationality, conceptualizing the legislator as a principal delegating advisory tasks to a boundedly rational agent under structural information asymmetry. Results reveal a sharp two-tier structure: frontier models (Claude Haiku 4.5, GPT-5-chat, GPT-5-mini) achieve statistically indistinguishable semantic closeness scores above 4.6 out of 5.0, while open-weight models cluster a full tier below (Cohen's d larger than 1.4). However, all models exhibit task-dependent confabulation, performing well on standardized legislative templates (e.g., EU directive transpositions) but generating plausible yet unfounded reasoning for politically idiosyncratic proposals. We introduce the concept of cascading bounded rationality to describe how failures compound across bounded principals, agents, and evaluators, and argue that the operative risk for legislators is not stable ideological bias but contextual ignorance shaped by training data coverage.