Institute for LogicRadboudUvAApr 2, 2026arXiv:2604.02043

Tracking the emergence of linguistic structure in self-supervised models learning from speech

Marianne de Heer Kloots, Martijn Bentum, Hosein Mohebbi, Charlotte Pouw, C. Pouw, Gaofei Shen, Willem H. Zuidema, Willem Zuidema

AI Summary

This paper investigates the emergence of linguistic structure in self-supervised speech models (Wav2Vec2 and HuBERT) trained on Dutch speech. By analyzing layerwise patterns and learning trajectories across different linguistic levels, the study reveals that the emergence is influenced by the degree of abstraction from the acoustic signal and the timescale of input integration. Furthermore, the pre-training objective significantly impacts the organization and learning trajectories, with higher-order prediction tasks fostering greater parallelism in the learned representations.

Key Contribution

Self-supervised speech models don't learn all linguistic features at once: the *order* in which they learn depends heavily on the pre-training objective and the level of abstraction of the linguistic feature.

Abstract

Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).

Interpretability & Mechanistic Interp Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Tracking the emergence of linguistic structure in self-supervised models learning from speech

Related Papers