Mar 9, 2026arXiv:2603.08359

Computational modeling of early language learning from acoustic speech and audiovisual input without linguistic priors

AI Summary

This chapter reviews computational models of early language acquisition, focusing on self-supervised and visually grounded approaches that minimize linguistic priors. It highlights the increasing power of these models in learning speech aspects and explaining early language development through shared learning principles. The review also discusses the growing realism of simulations in terms of input data and linking model behavior to empirical infant language development findings.

Key Contribution

Self-supervised and visually grounded models are closing the gap in explaining how infants learn language from raw acoustic and visual input, challenging the need for strong linguistic priors.

Abstract

Learning to understand speech appears almost effortless for typically developing infants, yet from an information-processing perspective, acquiring a language from acoustic speech is an enormous challenge. This chapter reviews recent developments in using computational models to understand early language acquisition from speech and audiovisual input. The focus is on self-supervised and visually grounded models of perceptual learning. We show how these models are becoming increasingly powerful in learning various aspects of speech without strong linguistic priors, and how many features of early language development can be explained through a shared set of learning principles-principles broadly compatible with multiple theories of language acquisition and human cognition. We also discuss how modern learning simulations are gradually becoming more realistic, both in terms of input data and in linking model behavior to empirical findings on infant language development.

Multimodal Models Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References2

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Computational modeling of early language learning from acoustic speech and audiovisual input without linguistic priors

Related Papers