March 18 – March 25, 2026

Speech & Audio - Weekly Roundup

36 papers published across 5 labs.

17% acceleration

Selected Labs publishing this week

NVIDIA1 CMU ML1 Tsinghua AI1 Mila1 Meta AI1

Top Papers

Mar 20, 2026

1w ago

Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

Unlock the potential of full-duplex speech language models with Sommelier, a new open-source pipeline that tackles the messy reality of multi-speaker conversations.

Kyudan Jung, Ji-Hoon Kim, Soyoon Kim +3

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Mar 19, 2026

Chen Zhang +71w ago

STEP: Scientific Time-Series Encoder Pretraining via Cross-Domain Distillation

Foundation models trained on audio, general time series, and brain signals can be distilled into a single, powerful encoder for scientific time series, unlocking performance gains on par with task-specific training.

Chen Zhang, Liwei Liu, Jun Tao +5

Inference & Quantization Scientific Discovery & Drug Design Speech & Audio

Anh-Tuan Dao +51w ago

Enhancing Multi-Corpus Training in SSL-Based Anti-Spoofing Models: Domain-Invariant Feature Extraction

Multi-corpus training can actually *hurt* spoofing detection, unless you strip out dataset-specific biases with this clever domain-invariant feature extraction trick.

Anh-Tuan Dao, D. Matrouf, Driss Matrouf +3

Data Curation & Synthetic Data Speech & Audio Training Efficiency & Optimization

Maxime Poli +51w ago

DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

Unsupervised phoneme discovery from self-supervised speech models is surprisingly viable, but language-specific challenges remain a significant hurdle.

Maxime Poli, Manel Khentout, Angelo Ortiz Tandazo +3

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

1w ago

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

LLMs' text-only pre-training secretly encodes surprisingly different levels of auditory knowledge, directly impacting their effectiveness as backbones for audio language models.

Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang +15

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

All Papers (36)

Mar 20, 2026

1w ago

Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

Unlock the potential of full-duplex speech language models with Sommelier, a new open-source pipeline that tackles the messy reality of multi-speaker conversations.

Kyudan Jung, Ji-Hoon Kim, Soyoon Kim +3

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Mar 19, 2026

Chen Zhang +71w ago

STEP: Scientific Time-Series Encoder Pretraining via Cross-Domain Distillation

Chen Zhang, Liwei Liu, Jun Tao +5

Inference & Quantization Scientific Discovery & Drug Design Speech & Audio

Anh-Tuan Dao +51w ago

Enhancing Multi-Corpus Training in SSL-Based Anti-Spoofing Models: Domain-Invariant Feature Extraction

Multi-corpus training can actually *hurt* spoofing detection, unless you strip out dataset-specific biases with this clever domain-invariant feature extraction trick.

Anh-Tuan Dao, D. Matrouf, Driss Matrouf +3

Data Curation & Synthetic Data Speech & Audio Training Efficiency & Optimization

Maxime Poli +51w ago

DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

Unsupervised phoneme discovery from self-supervised speech models is surprisingly viable, but language-specific challenges remain a significant hurdle.

Maxime Poli, Manel Khentout, Angelo Ortiz Tandazo +3

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

1w ago

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

LLMs' text-only pre-training secretly encodes surprisingly different levels of auditory knowledge, directly impacting their effectiveness as backbones for audio language models.

Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang +15

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Amandine Brunetto +11w ago

Few-shot Acoustic Synthesis with Multimodal Flow Matching

Synthesizing realistic room acoustics from a single recording is now possible, thanks to a novel flow-matching approach that captures the uncertainty inherent in acoustic environments.

Amandine Brunetto, Amandine Brunetto

Multimodal Models Speech & Audio

1w ago

AU Codes, Language, and Synthesis: Translating Anatomy to Text for Facial Behavior Synthesis

Ditch one-hot vectors: representing facial action units as natural language unlocks more realistic and nuanced facial expression synthesis, especially when dealing with conflicting muscle movements.

Jiahe Wang, Cong Liang, Xuandong Huang +5

Computer Vision Natural Language Processing Speech & Audio

Hung-Yue Suen +21w ago

Dual-Model Prediction of Affective Engagement and Vocal Attractiveness from Speaker Expressiveness in Video Learning

You can predict how engaged and attracted viewers are to a video lecture just by analyzing the speaker's face and voice, no audience data needed.

Hung-Yue Suen, Kuo-En Hung, Fan-Hsun Tseng

Computer Vision Natural Language Processing Speech & Audio

NVIDIA1w ago·also Pitt

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Current OmniLLMs stumble when processing real-world, long-form audio-visual content, achieving only ~35-65% accuracy on a new benchmark designed to test long-term memory and fine-grained understanding.

Keda Tao, Keda Tao, Yuhua Zheng +25

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Dana'e Broustail +61w ago

LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling

Foundation models for EEG can now be 377x more efficient and handle 12x longer sequences, thanks to a novel Mamba-based architecture that also cracks the code for handling variable electrode setups.

Dana'e Broustail, Danaé Broustail, Anna Tegon +4

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

1w ago

Words at Play: Benchmarking Audio Pun Understanding in Large Audio-Language Models

LALMs still struggle to get the joke, with a new benchmark showing they can't reliably recognize, locate, or understand audio puns.

Yuchen Su, Shaoxin Zhong, Yonghua Zhu +7

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Fuze Sun +41w ago

Empathetic Motion Generation for Humanoid Educational Robots via Reasoning-Guided Vision--Language--Motion Diffusion Architecture

Humanoid robots can now generate more empathetic and instruction-aware gestures thanks to a new diffusion framework conditioned on affective estimation and pedagogical reasoning.

Fuze Sun, Lingyu Li, Lekan Dai +2

Multimodal Models Robotics & Embodied AI Speech & Audio

Siqi Song +21w ago

ARTT: Augmented Reverberant-Target Training for Unsupervised Monaural Speech Dereverberation

Training a DNN to recover a reverberant signal from a *more* reverberant version surprisingly reduces reverberation in the original signal.

Siqi Song, Fulin Wu, Zhong-Qiu Wang

Data Curation & Synthetic Data Speech & Audio Training Efficiency & Optimization

Bingqi Ma +71w ago

Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Achieve state-of-the-art joint audio-video generation with fewer resources by fixing key flaws in cross-modal context handling within dual-stream transformers.

Bingqi Ma, Linlong Lang, Ming Zhang +5

Computer Vision Multimodal Models Speech & Audio

Aravind Krishnan +31w ago

On Optimizing Multimodal Jailbreaks for Spoken Language Models

SLMs are shockingly vulnerable: combining adversarial audio and text unlocks 1.5x to 10x higher jailbreak rates than attacking either modality alone.

Aravind Krishnan, Karolina Stańczak, Karolina Sta'nczak +1

Multimodal Models Red-Teaming & Adversarial Robustness Speech & Audio

1w ago·also Binzhou Medical University Hospital, Heart Voice Medical Technology, Interdisciplinary Center of Sleep Medicine, The Second Hospital of Tianjin Medical University +1

Holter-to-Sleep: AI-Enabled Repurposing of Single-Lead ECG for Sleep Phenotyping

Unlock scalable cardio-sleep insights by repurposing ubiquitous single-lead ECG data for accurate sleep phenotyping, rivalling resource-intensive polysomnography.

Donglin Xie, Qingshuo Zhao, Jingyu Wang +10

Scientific Discovery & Drug Design Speech & Audio

Mar 18, 2026

2w ago·also Fudan

MOSS-TTS Technical Report

Achieve controllable and scalable speech generation with MOSS-TTS, enabling zero-shot voice cloning and long-form synthesis.

Yitian Gong, Y. Gong, Botian Jiang +28

Architecture Design (Transformers, SSMs, MoE)Open-Source Models & Weights Speech & Audio

Gexin Zhao2w ago

Beyond bouba/kiki: Multidimensional semantic signals are deeply woven into the fabric of natural language

LLMs can extract consistent, multidimensional semantic information directly from the phonological structure of language, revealing a non-arbitrary relationship between sound and meaning.

Gexin Zhao

Natural Language Processing Speech & Audio

2w ago·also Bilibili Inc.

Deploying Semantic ID-based Generative Retrieval for Large-Scale Podcast Discovery at Spotify

Spotify's GLIDE model proves that generative LLMs can drive significant gains in podcast discovery and non-habitual listening in a real-world, production environment.

Edoardo D'Amico, Marco De Nadai, Praveen Chandar +56

Natural Language Processing Recommendation & Information Retrieval Speech & Audio

H. Samanta2w ago

Impact of automatic speech recognition quality on Alzheimer's disease detection from spontaneous speech: a reproducible benchmark study with lexical modeling and statistical validation

Counterintuitively, better speech recognition unlocks surprisingly accurate Alzheimer's detection from simple text analysis, outperforming more complex acoustic models.

H. Samanta

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Yuxiang Mei +42w ago

Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition

Stop struggling with the stability-plasticity dilemma in multilingual Speech-LLMs: Zipper-LoRA dynamically disentangles LoRA updates to boost low-resource ASR without sacrificing cross-lingual transfer.

Yuxiang Mei, Delai Qiu, Shengping Liu +2

Multimodal Models Speech & Audio Training Efficiency & Optimization

CMU ML2w ago

Modeling Overlapped Speech with Shuffles

Achieve single-pass alignment of multi-talker speech – a feat previously impossible – by modeling overlaps as shuffles.

Matthew Wiesner, Samuele Cornell, Alexander Polok +5

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

2w ago·also Tencent AI

Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

Finally, a unified framework lets you control both facial appearance and voice timbre for personalized audio-video generation across multiple identities.

Yingjie Chen, Shilun Lin, Cai Xing +5

Computer Vision Multimodal Models Speech & Audio

Xiangyu Kong +82w ago

ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

Interactive avatars can now exhibit more emotionally appropriate and contextually aware facial behaviors thanks to a novel architecture that disentangles audio-driven lip movements from user-driven non-lip facial expressions.

Xiangyu Kong, Xiaoyu Jin, Yihan Pan +6

Computer Vision Multimodal Models Speech & Audio

Zhanqi Zhang +42w ago

ALIGN: Adversarial Learning for Generalizable Speech Neuroprosthesis

Adversarial training can effectively disentangle session-specific noise from task-relevant speech features in brain-computer interfaces, leading to more robust decoding across recording sessions.

Zhanqi Zhang, Shun Li, Bernardo L. Sabatini +2

Red-Teaming & Adversarial Robustness Speech & Audio

Aivo Olev +22w ago·also TalTech

Multi-Source Evidence Fusion for Audio Question Answering

Grounding LALM reasoning in diverse, reliability-weighted acoustic evidence blows away the competition in Audio Question Answering, proving that verifiable chains beat black boxes.

Aivo Olev, Tanel Alumäe, Tanel Alumae

Reasoning & Chain-of-Thought Speech & Audio Tool Use & Agents

2w ago

Uncertainty Quantification and Risk Control for Multi-Speaker Sound Source Localization

Sound source localization gets a reliability upgrade: conformal prediction delivers uncertainty estimates, even when you don't know how many speakers are talking.

Vadim Rozenfeld, Bracha Laufer Goldshtein

Computer Vision Speech & Audio

Tsinghua AI2w ago·also Meta AI, Mila

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

Mimicking human cognition, FLAIR lets dialogue models "think while listening," boosting performance without adding latency.

Donghang Wu, Tianyu Zhang, Yuxin Li +6

Natural Language Processing Reasoning & Chain-of-Thought Speech & Audio

2w ago·also Division of Pediatric Plastic Surgery

Robust Nasality Representation Learning for Cleft Palate-Related Velopharyngeal Dysfunction Screening in Real-World Settings

Pre-training on nasal vs. oral context lets a simple model beat large pre-trained speech models at detecting speech disorders in noisy, real-world settings.

Weixin Liu, Bowen Qu, Amy Stone +11

Natural Language Processing Speech & Audio

2w ago·also Fondazione Bruno Kessler

EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

Even when visual data is missing or noisy, EgoAdapt accurately determines who is talking to the camera wearer by adaptively integrating head orientation, lip movement, and robust audio features.

Xinyuan Qian, Xinjia Zhu, A. Brutti +1

Computer Vision Multimodal Models Speech & Audio

Shih-Heng Wang +62w ago

Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Acoustic and phonetic NACs encode accent in fundamentally different ways, with implications for how we interpret and manipulate these representations.

Shih-Heng Wang, Tiantian Feng, Aditya Kommineni +4

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Speech & Audio

Xiutian Zhao +42w ago

Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

Control the emotional tone of generated speech without any training by directly manipulating specific neurons within large audio-language models.

Xiutian Zhao, Ismail Rasim Ulgen, Philipp Koehn +2

Interpretability & Mechanistic Interp Natural Language Processing Speech & Audio

2w ago·also Lancaster University

AURORA Model of Formant-to-Tongue Inversion for Didactic and Clinical Applications

Imagine seeing your tongue move in real-time based on the sounds you make – AURORA brings that closer to reality.

Patrycja Strycharczuk, Sam Kirkham

Scientific Discovery & Drug Design Speech & Audio

2w ago·also XJTU

STEP: Detecting Audio Backdoor Attacks via Stability-based Trigger Exposure Profiling

Audio backdoor attacks leave a tell: triggers are surprisingly stable to destructive noise but fragile to meaning-preserving changes.

Kun Wang, Meng Chen, Junhao Wang +6

Red-Teaming & Adversarial Robustness Speech & Audio

Youssef Youssef +12w ago

Pathology-Aware Multi-View Contrastive Learning for Patient-Independent ECG Reconstruction

By explicitly modeling cardiac pathology, this ECG reconstruction method achieves a 76% reduction in error compared to existing techniques, promising more accurate diagnoses from portable devices.

Youssef Youssef, Jitin Singla

Scientific Discovery & Drug Design Speech & Audio

2w ago

Scalable and Personalized Oral Assessments Using Voice AI

Oral exams, previously impossible to scale, can now be delivered for pennies using voice AI, but controlling LLM behavior requires architectural guardrails, not just clever prompts.

Panos Ipeirotis, Konstantinos Rizakos

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Search

Speech & Audio - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (36)