Search papers, labs, and topics across Lattice.
This paper introduces a factor-partitioned embedding framework for speech that disentangles multiple attributes like linguistic content, speaker identity, and dialect into distinct subspaces of a single embedding vector. Each subspace is trained via distillation from specialist teachers or contrastive learning. The resulting embeddings enable attribute-conditioned retrieval through signed weighted sums of per-axis cosine similarities. Experiments on cross-corpus retrieval demonstrate the framework's ability to suppress speaker bias and retrieve semantically matched utterances across diverse recording conditions.
Stop letting speaker identity drown out semantic similarity: this new embedding method lets you independently control the influence of different speech attributes when comparing utterances.
Speech encodes multiple simultaneous attributes--linguistic content, speaker identity, dialect, gender--that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how --or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions.