Adobe ResearchApr 20, 2026arXiv:2604.17930

Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?

H S V N S Kowndinya Renduchintala, Sumit Bhatia

AI Summary

This paper investigates the formal linguistic competence of LLMs, focusing on why they struggle with certain grammatical constructions despite massive pre-training. They pre-trained GPT-2 Small models on FineWeb and augmented the data with 1% synthetic data targeting specific linguistic phenomena. The results show significant performance improvements in 8 out of 9 previously low-performing BLiMP paradigms, suggesting that data scarcity, rather than architectural limitations, is a key bottleneck.

Key Contribution

LLMs' surprising grammatical struggles aren't due to inherent limitations, but rather a lack of exposure to specific linguistic structures in their training data – a problem fixable with just a tiny amount of targeted data augmentation.

Abstract

Large Language Models (LLMs) exhibit a puzzling disparity in their formal linguistic competence: while they learn some linguistic phenomena with near-perfect mastery, they often perform below chance on others, even after training on trillions of tokens. In this work, we investigate whether these failures stem from inherent architectural limitations or simply the scarcity of these specific grammatical constructions in web-scale corpora. We pre-train simple GPT-2 Small (124M) models on a 100M-token random sample of the FineWeb corpus and intervene by injecting a minimal amount (1%) of synthetic data targeting specific linguistic phenomena. We find that this targeted intervention substantially improves model performance in 8 out of the 9 worst-performing BLiMP paradigms - notably the accuracy on a specific paradigm, only_npi_scope, surges from 20.9% to 69.4%. Furthermore, we observe that these interventions generally preserve or slightly improve aggregate performance. However, while we also identify a resistant phenomenon, principle_A_c_command, whose performance remains below chance even after our data augmentation, our findings do serve as an optimistic existence proof that even small language models can substantially improve on those linguistic phenomena on which models typically perform poorly, provided the pre-training data contains sufficient exposure to them. This suggests that efforts towards human-scale language modeling may benefit greatly by focusing on data composition. The code to reproduce our results is open-sourced at https://github.com/kowndinya-renduchintala/heterogeneity-in-formal-linguistic-competence.

Data Curation & Synthetic Data Natural Language Processing Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?

Related Papers