Mar 10, 2026arXiv:2603.09595

Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models

AI Summary

This paper investigates the trade-offs between building domain-specific NLP models from scratch, adapting existing models, and fine-tuning general-purpose models, specifically within the context of political science research. The author fine-tunes ModernBERT on the Global Terrorism Database (GTD) to create Confli-mBERT and compares its performance to ConfliBERT, a domain-specific pretrained model, for conflict event classification. Results show that while ConfliBERT achieves slightly higher overall accuracy (79.34% vs. 75.46%), Confli-mBERT performs comparably on high-frequency attack types, suggesting fine-tuning can be a viable alternative for many tasks.

Key Contribution

Don't build a domain-specific model just because you can: fine-tuning a general-purpose model can achieve comparable performance on common tasks, saving significant resources.

Abstract

Political scientists increasingly face a consequential choice when adopting natural language processing tools: build a domain-specific model from scratch, borrow and adapt an existing one, or simply fine-tune a general-purpose model on task data? Each approach occupies a different point on the spectrum of performance, cost, and required expertise, yet the discipline has offered little empirical guidance on how to navigate this trade-off. This paper provides such guidance. Using conflict event classification as a test case, I fine-tune ModernBERT on the Global Terrorism Database (GTD) to create Confli-mBERT and systematically compare it against ConfliBERT, a domain-specific pretrained model that represents the current gold standard. Confli-mBERT achieves 75.46% accuracy compared to ConfliBERT's 79.34%. Critically, the four-percentage-point gap is not uniform: on high-frequency attack types such as Bombing/Explosion (F1 = 0.95 vs. 0.96) and Kidnapping (F1 = 0.92 vs. 0.91), the models are nearly indistinguishable. Performance differences concentrate in rare event categories comprising fewer than 2% of all incidents. I use these findings to develop a practical decision framework for political scientists considering any NLP-assisted research task: when does the research question demand a specialized model, and when does an accessible fine-tuned alternative suffice? The answer, I argue, depends not on which model is "better" in the abstract, but on the specific intersection of class prevalence, error tolerance, and available resources. The model, training code, and data are publicly available on Hugging Face.

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models

Related Papers