Search papers, labs, and topics across Lattice.
This paper investigates whether LLMs trained on limited data, mimicking developmental constraints, develop shared representations for filler-gap dependencies across different syntactic constructions. They use Distributed Alignment Search (DAS) to analyze LMs trained on varying amounts of BabyLM data, focusing on transfer between wh-questions and topicalization. The results indicate that while shared, item-sensitive mechanisms emerge, LLMs require significantly more data than humans, underscoring the importance of inductive biases in language acquisition models.
LLMs may learn shared syntactic dependencies even with limited data, but they're still data-hungry toddlers compared to humans.
For humans, filler-gap dependencies require a shared representation across different syntactic constructions. Although causal analyses suggest this may also be true for LLMs (Boguraev et al., 2025), it is still unclear if such a representation also exists for language models trained on developmentally feasible quantities of data. We applied Distributed Alignment Search (DAS, Geiger et al. (2024)) to LMs trained on varying amounts of data from the BabyLM challenge (Warstadt et al., 2023), to evaluate whether representations of filler-gap dependencies transfer between wh-questions and topicalization, which greatly vary in terms of their input frequency. Our results suggest shared, yet item-sensitive mechanisms may develop with limited training data. More importantly, LMs still require far more data than humans to learn comparable generalizations, highlighting the need for language-specific biases in models of language acquisition.