Speech & Audio - Weekly Roundup

Spectrogram Features for Audio and Speech Analysis

Singapore Institute of Technology2w ago·also Meta AI, Austrian Institute of Technology, Duke

The optimal spectrogram configuration for audio and speech analysis hinges on a nuanced interplay between front-end feature representation and back-end classifier architecture, varying significantly across tasks.

Ian McLoughlin, Lam Pham, L. Pham +9

PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake Detection and Naturalness Evaluation

2w ago·also Institut national de la recherche

Quantifying the divergence between real and synthetic phoneme distributions via Kullback-Leibler divergence can pinpoint the most vulnerable phonemes for detecting audio deepfakes.

Vamshi Nallaguntla, Aishwarya Fursule, S. Kshirsagar +2

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Speech & Audio

Mar 18, 2026

2w ago·also Fudan

MOSS-TTS Technical Report

Achieve controllable and scalable speech generation with MOSS-TTS, enabling zero-shot voice cloning and long-form synthesis.

Y. Gong, Yitian Gong, Botian Jiang +28

Architecture Design (Transformers, SSMs, MoE)Open-Source Models & Weights Speech & Audio

Gexin Zhao2w ago

Beyond bouba/kiki: Multidimensional semantic signals are deeply woven into the fabric of natural language

LLMs can extract consistent, multidimensional semantic information directly from the phonological structure of language, revealing a non-arbitrary relationship between sound and meaning.

Gexin Zhao

MOSS-TTS Technical Report

All Papers (100)

Mar 18, 2026

2w ago·also Fudan

Achieve controllable and scalable speech generation with MOSS-TTS, enabling zero-shot voice cloning and long-form synthesis.

Y. Gong, Yitian Gong, Botian Jiang +28

Architecture Design (Transformers, SSMs, MoE)Open-Source Models & Weights Speech & Audio

Gexin Zhao2w ago

Beyond bouba/kiki: Multidimensional semantic signals are deeply woven into the fabric of natural language

LLMs can extract consistent, multidimensional semantic information directly from the phonological structure of language, revealing a non-arbitrary relationship between sound and meaning.

Gexin Zhao

Deploying Semantic ID-based Generative Retrieval for Large-Scale Podcast Discovery at Spotify

2w ago·also Bilibili Inc.

Spotify's GLIDE model proves that generative LLMs can drive significant gains in podcast discovery and non-habitual listening in a real-world, production environment.

Edoardo D'Amico, Marco De Nadai, P. Chandar +56

Natural Language Processing Recommendation & Information Retrieval Speech & Audio

H. Samanta2w ago

Impact of automatic speech recognition quality on Alzheimer's disease detection from spontaneous speech: a reproducible benchmark study with lexical modeling and statistical validation

Counterintuitively, better speech recognition unlocks surprisingly accurate Alzheimer's detection from simple text analysis, outperforming more complex acoustic models.

H. Samanta

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Yuxiang Mei +42w ago

Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition

Stop struggling with the stability-plasticity dilemma in multilingual Speech-LLMs: Zipper-LoRA dynamically disentangles LoRA updates to boost low-resource ASR without sacrificing cross-lingual transfer.

Yuxiang Mei, Delai Qiu, Shengping Liu +2

Multimodal Models Speech & Audio Training Efficiency & Optimization

CMU ML2w ago

Modeling Overlapped Speech with Shuffles

Achieve single-pass alignment of multi-talker speech – a feat previously impossible – by modeling overlaps as shuffles.

Matthew Wiesner, Samuele Cornell, Alexander Polok +5

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

2w ago·also Tencent AI

Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

Finally, a unified framework lets you control both facial appearance and voice timbre for personalized audio-video generation across multiple identities.

Yingjie Chen, Shilun Lin, Cai Xing +5

ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

Xiangyu Kong +82w ago

Interactive avatars can now exhibit more emotionally appropriate and contextually aware facial behaviors thanks to a novel architecture that disentangles audio-driven lip movements from user-driven non-lip facial expressions.

Xiangyu Kong, Xiaoyu Jin, Yihan Pan +6

ALIGN: Adversarial Learning for Generalizable Speech Neuroprosthesis

Zhanqi Zhang +42w ago

Adversarial training can effectively disentangle session-specific noise from task-relevant speech features in brain-computer interfaces, leading to more robust decoding across recording sessions.

Zhanqi Zhang, Shun Li, Bernardo L. Sabatini +2

Red-Teaming & Adversarial Robustness Speech & Audio

Aivo Olev +22w ago·also TalTech

Multi-Source Evidence Fusion for Audio Question Answering

Grounding LALM reasoning in diverse, reliability-weighted acoustic evidence blows away the competition in Audio Question Answering, proving that verifiable chains beat black boxes.

Aivo Olev, Tanel Alumae, Tanel Alumäe

Reasoning & Chain-of-Thought Speech & Audio Tool Use & Agents

2w ago

Uncertainty Quantification and Risk Control for Multi-Speaker Sound Source Localization

Sound source localization gets a reliability upgrade: conformal prediction delivers uncertainty estimates, even when you don't know how many speakers are talking.

Vadim Rozenfeld, Bracha Laufer Goldshtein

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

Tsinghua AI2w ago·also Meta AI, Mila

Mimicking human cognition, FLAIR lets dialogue models "think while listening," boosting performance without adding latency.

Donghang Wu, Tianyu Zhang, Yuxin Li +6

Natural Language Processing Reasoning & Chain-of-Thought Speech & Audio

2w ago·also Division of Pediatric Plastic Surgery

Robust Nasality Representation Learning for Cleft Palate-Related Velopharyngeal Dysfunction Screening in Real-World Settings

Pre-training on nasal vs. oral context lets a simple model beat large pre-trained speech models at detecting speech disorders in noisy, real-world settings.

Weixin Liu, Bowen Qu, Amy Stone +11

EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

2w ago·also Fondazione Bruno Kessler

Even when visual data is missing or noisy, EgoAdapt accurately determines who is talking to the camera wearer by adaptively integrating head orientation, lip movement, and robust audio features.

Xinyuan Qian, Xinjia Zhu, A. Brutti +1

Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Shih-Heng Wang +62w ago

Acoustic and phonetic NACs encode accent in fundamentally different ways, with implications for how we interpret and manipulate these representations.

Shih-Heng Wang, Tiantian Feng, Aditya Kommineni +4

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Speech & Audio

Xiutian Zhao +42w ago

Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

Control the emotional tone of generated speech without any training by directly manipulating specific neurons within large audio-language models.

Xiutian Zhao, Ismail Rasim Ulgen, Philipp Koehn +2

Interpretability & Mechanistic Interp Natural Language Processing Speech & Audio

2w ago·also Lancaster University

AURORA Model of Formant-to-Tongue Inversion for Didactic and Clinical Applications

Imagine seeing your tongue move in real-time based on the sounds you make – AURORA brings that closer to reality.

Patrycja Strycharczuk, Sam Kirkham

Scientific Discovery & Drug Design Speech & Audio

2w ago·also XJTU

STEP: Detecting Audio Backdoor Attacks via Stability-based Trigger Exposure Profiling

Audio backdoor attacks leave a tell: triggers are surprisingly stable to destructive noise but fragile to meaning-preserving changes.

Kun Wang, Meng Chen, Junhao Wang +6

Red-Teaming & Adversarial Robustness Speech & Audio

Youssef Youssef +12w ago

Pathology-Aware Multi-View Contrastive Learning for Patient-Independent ECG Reconstruction

By explicitly modeling cardiac pathology, this ECG reconstruction method achieves a 76% reduction in error compared to existing techniques, promising more accurate diagnoses from portable devices.

Youssef Youssef, Jitin Singla

Scientific Discovery & Drug Design Speech & Audio

2w ago

Scalable and Personalized Oral Assessments Using Voice AI

Oral exams, previously impossible to scale, can now be delivered for pennies using voice AI, but controlling LLM behavior requires architectural guardrails, not just clever prompts.

Panos Ipeirotis, Konstantinos Rizakos

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Mar 17, 2026

Tsinghua AI2w ago

Making Separation-First Multi-Stream Audio Watermarking Feasible via Joint Training

Jointly training audio watermarking and source separation unlocks robust multi-stream watermarking, enabling independent tracking of individual audio components within a mix.

Houmin Sun, Zi Hu, Zipei Hu +4

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

2w ago

CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS

Ditch the separate models: CAST-TTS uses a single cross-attention mechanism to control TTS timbre from both speech and text, rivaling specialized models in quality.

Zihao Zheng, Wen Wu, Chao Zhang +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Joseph Cameron +22w ago

Evaluating Latent Space Structure in Timbre VAEs: A Comparative Study of Unsupervised, Descriptor-Conditioned, and Perceptual Feature-Conditioned Models

Forget one-hot encodings: conditioning timbre VAEs on continuous perceptual features unlocks more compact and controllable latent spaces.

Joseph Cameron, Alan Blackwell, Alan F. Blackwell

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Speech & Audio

University Grenoble Alpes2w ago·also SynapCell SAS, ZAC ISIPARC

SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

By forcing a model to reconstruct aggressively masked EEG spectrograms, SpecMoE learns intricate neural patterns across both high- and low-frequency domains, leading to state-of-the-art cross-species EEG decoding.

D. Darankoum, C. Habermacher, J. Volle +1

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design Speech & Audio

2w ago·also Lancaster University

PyPhonPlan: Simulating phonetic planning with dynamic neural fields and task dynamics

PyPhonPlan offers a new open-source toolkit to simulate speech dynamics with neurally-grounded representations, enabling researchers to model interactive speech production and perception loops.

Sam Kirkham

Robotics & Embodied AI Speech & Audio World Models & Planning

2w ago

Is Semi-Automatic Transcription Useful in Corpus Creation? Preliminary Considerations on the KIParla Corpus

ASR-assisted transcription doesn't automatically improve accuracy in corpus creation, and its effectiveness hinges on factors like workflow design and transcriber expertise.

Martina Simonotti, Eleonora Zucchini, Silvia Ballarè +1

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Joseph Cameron +22w ago

A Semantic Timbre Dataset for the Electric Guitar

Unlock timbre-aware generative AI with a new dataset linking semantic descriptors to electric guitar sounds, enabling nuanced control over audio synthesis.

Joseph Cameron, Alan Blackwell, Alan F. Blackwell

Data Curation & Synthetic Data Speech & Audio

Rina Veler +12w ago

Speakers Localization Using Batch EM In Unfolding Neural Network

Unfolding the EM algorithm into a neural network yields a speaker localization method that's more robust and accurate than traditional Batch-EM, especially in challenging acoustic conditions.

Rina Veler, Sharon Gannot

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Shubham Gupta +32w ago

Shared Representation Learning for Reference-Guided Targeted Sound Detection

A shared encoder for targeted sound detection leaps past prior art, achieving a new state-of-the-art F1 score of 83.15% on URBAN-SED while simplifying the model architecture.

Shubham Gupta, Adarsh Arigala, Sri B. R. Dilleswari +1

Over-the-air White-box Attack on the Wav2Vec Speech Recognition Neural Network

Protopopov Alexey2w ago

Stealthier over-the-air adversarial attacks on speech recognition are possible, but require careful balancing of audibility and effectiveness.

Protopopov Alexey

Red-Teaming & Adversarial Robustness Speech & Audio

2w ago

On the Emotion Understanding of Synthesized Speech

SER models, often assumed to generalize well to synthesized speech, actually fail miserably, revealing their reliance on spurious correlations rather than genuine emotional understanding.

Yuan Ge, Haishu Zhao, Aokai Hao +9

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Yoav Ellinson +12w ago

HRTF-guided Binaural Target Speaker Extraction with Real-World Validation

Ditch DOA estimation: this new target speaker extraction method uses HRTFs to preserve spatial audio cues and boost speech quality.

Yoav Ellinson, Sharon Gannot

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

2w ago

SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

A new spoken user simulator, SpokenUS, trained on a large-scale dataset, finally captures the messiness of real human conversation, including barge-ins and disfluencies, to better train dialogue agents.

Jonggeun Lee, Junseong Pyo, Jeongmin Park +1

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Xiamen University2w ago·also Rochester, Sichuan Agricultural University

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Current Omni-modal LLMs can ace perception tasks but still fail at basic social interactions like knowing when and how to jump into a conversation.

Tianyu Xie, Jinfa Huang, Yuexiao Ma +7

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Timo K. Koch +42w ago

Collecting Prosody in the Wild: A Content-Controlled, Privacy-First Smartphone Protocol and Empirical Evaluation

A new smartphone protocol enables large-scale, privacy-preserving collection of prosodic speech data in the wild, opening doors to studying the subtle emotional nuances in everyday communication.

Timo K. Koch, Florian Bemmann, Ramona Schoedel +2

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

2w ago·also Tsinghua AI, HUST, KU, SCUT +2

Attention-guided Evidence Grounding for Spoken Question Answering

SpeechLLMs can be made significantly faster and more accurate at question answering by explicitly training their attention mechanisms to focus on relevant evidence.

Ke Yang, Bolin Chen, Yuejie Li +5

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Meta AI2w ago·also UPC

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

OmniSONAR halves cross-lingual search error on FLORES and reduces error by 15x on BIBLE, proving that truly universal sentence embeddings across thousands of languages and modalities are now within reach.

Omnilingual Sonar Team Joao Maria Janeiro, Omnilingual SONAR Team, João Maria Janeiro +23

Multimodal Models Natural Language Processing Speech & Audio

Quy-Anh Dang +12w ago

Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR

Get competitive multilingual ASR performance with 6x smaller models and 200x less training cost by using balanced fine-tuning and implicit language learning.

Quy-Anh Dang, Chris Ngo

Natural Language Processing Open-Source Models & Weights Speech & Audio

2w ago·also Cambridge, SJTU

Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

Robots can now use real-time environmental sounds to guide manipulation tasks, thanks to a new framework that overcomes the "Blind Execution Interval" of traditional vision-language-action models.

Chang Nie, Tianchen Deng, Guangming Wang +2

Multimodal Models Robotics & Embodied AI Speech & Audio

Hujing Digital Media and Entertainment2w ago·also CAS

CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

Speaker diarization in movies and TV shows just got a whole lot better, thanks to a new multimodal framework that uses visual cues, speech, and subtitles to handle the chaos of open-world video.

Liangbin Huang, Xiaohua Liao, Chaoqun Cui +7

Diffusion Models for Joint Audio-Video Generation

Alejandro Paredes La Torre2w ago

Forget painstakingly aligning audio and video – this diffusion model learns to generate them jointly, opening the door to more realistic and immersive multimodal experiences.

Alejandro Paredes La Torre

Robust Generative Audio Quality Assessment: Disentangling Quality from Spurious Correlations

Kuan-Tang Huang +52w ago

Forget static domain priors: the best way to rate AI-generated audio quality depends on *which* aspect of quality you're measuring.

Kuan-Tang Huang, Chien-Chun Wang, Cheng-Yeh Yang +3

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Speech & Audio

Abhishek Kumar +12w ago

RECOVER: Robust Entity Correction via agentic Orchestration of hypothesis Variants for Evidence-based Recovery

An agentic framework slashes entity recognition errors in ASR by up to 46% by intelligently combining multiple ASR hypotheses and constrained LLM correction.

Abhishek Kumar, Aashraya Sachdeva

Natural Language Processing Speech & Audio Tool Use & Agents

Mar 16, 2026

Philipp Bogdan2w ago

INSTRUMENTAL: Automatic Synthesizer Parameter Recovery from Audio via Evolutionary Optimization

Recovering synthesizer parameters directly from audio is now possible with Instrumental, a system that combines a differentiable synthesizer with evolutionary optimization, opening new avenues for timbral analysis and manipulation.

Philipp Bogdan

Speech & Audio Training Efficiency & Optimization

Singapore Institute of Technology2w ago·also Meta AI, Austrian Institute of Technology, Duke

Spectrogram Features for Audio and Speech Analysis

Ian McLoughlin, Lam Pham, L. Pham +9

Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation

2w ago·also Replicate, Vec2 (baseline) [14] V+AV+A 0.25

By using text as an anchor, this model achieves state-of-the-art emotional mimicry intensity estimation, even when visual and acoustic data are noisy or missing.

Lingsi Zhu, Yuefeng Zou, Yunxiang Zhang +7

Multimodal Models Natural Language Processing Speech & Audio

2w ago·also TAU

Robust Language Identification for Romansh Varieties

A 97% accurate Romansh idiom classifier unlocks idiom-aware NLP tools for a low-resource language.

Charlotte Model, Sina Ahmadi, Jannis Vamvas

Investigating the Impact of Speech Enhancement on Audio Deepfake Detection in Noisy Environments

2w ago·also Institut national de la recherche

Speech enhancement doesn't always improve audio deepfake detection; in fact, algorithms that *reduce* perceptual speech quality can paradoxically lead to better spoof detection in noisy environments.

Anacin, Angela, S. Kshirsagar +2

Red-Teaming & Adversarial Robustness Speech & Audio

Simon Devauchelle +122w ago

spINAch: A Diachronic Corpus of French Broadcast Speech Controlled for Speakers'Age and Gender

A new 320-hour corpus of French speech reveals how pronunciation has changed over six decades, including the surprising finding that voice pitch evolution doesn't differ by gender.

Simon Devauchelle, David Doukhan, Rémi Uro +10

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Marc Casals-Salvador +32w ago·also Barcelona Supercomputing Center

How Attention Shapes Emotion: A Comparative Study of Attention Mechanisms for Speech Emotion Recognition

Efficient attention mechanisms like RetNet and LightNet can speed up Speech Emotion Recognition by an order of magnitude, but at the cost of some accuracy compared to standard self-attention.

Marc Casals-Salvador, Federico Costa, Rodolfo Zevallos +1

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

2w ago·also Institut national de la recherche

PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake Detection and Naturalness Evaluation

Quantifying the divergence between real and synthetic phoneme distributions via Kullback-Leibler divergence can pinpoint the most vulnerable phonemes for detecting audio deepfakes.

Vamshi Nallaguntla, Aishwarya Fursule, S. Kshirsagar +2

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Speech & Audio

U.H Shin +62w ago

Deep Filter Estimation from Inter-Frame Correlations for Monaural Speech Dereverberation

By shifting the learning objective from direct spectral mapping to filter estimation based on inter-frame correlations, IF-CorrNet achieves state-of-the-art monaural speech dereverberation performance, particularly in real-world environments where generalization is critical.

U.H Shin, Ui-Hyeop Shin, Jun Hyung Kim +4

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio+1

Tsinghua AI2w ago

Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task

MLLMs still can't handle time-sensitive multimodal reasoning, often failing to integrate auditory and visual cues effectively in dynamic environments like a 4D escape room.

Yurui Dong, Ziyue Wang, Shuyun Lu +3

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

2w ago·also AppTek

LLMs and Speech: Integration vs. Combination

Forget simply bolting on an LLM: this work reveals the surprisingly intricate dance between acoustic models and LLMs needed to unlock state-of-the-art speech recognition.

Robin Schmitt, Albert Zeyer, Mohammad Zeineldeen +3

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Cheng Luo +72w ago

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

Finally, realistic and diverse listener reactions to speech can be automatically generated, moving beyond simple retrieval or LLM-driven approaches.

Cheng Luo, Bizhu Wu, Bing Li +5

Multimodal Models Natural Language Processing Speech & Audio

Devansh Zurale +32w ago

AILive Mixer: A Deep Learning based Zero Latency Automatic Music Mixer for Live Music Performances

For live music performances, this work achieves zero-latency automatic music mixing using deep learning, a feat previously unachieved due to the challenges of acoustic bleed and synchronization constraints.

Devansh Zurale, Iris Lorente, Michael Lester +1

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

2w ago·also UC Santa Cruz

Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

Current reward models for spoken dialogue systems are missing crucial paralinguistic and natural speech elements, but this new model closes the gap by operating directly on speech and outperforming existing audio LLMs.

Jingyu Lu, Yuhan Wang, Fan Zhuo +7

Natural Language Processing RLHF & Preference Learning Speech & Audio

Guorui Lu +22w ago

Vib2ECG: A Paired Chest-Lead SCG-ECG Dataset and Benchmark for ECG Reconstruction

Forget expensive ECG hardware: this dataset and benchmark show you can reconstruct clinically useful chest-lead ECGs from cheap vibrational sensors, but watch out for "hallucinated" heartbeats.

Guorui Lu, Xiaohui Cai, Todor Stefanov

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Speech & Audio

2w ago·also Soul AI Lab

SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation

Achieve human-like full-duplex voice interactions with SoulX-Duplug, a plug-and-play module that slashes latency and improves turn management by acting as a semantic VAD.

Ruiqi Yan, Wenxi Chen, Zhanxun Liu +19

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Qinke Ni +42w ago

NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation

Standardized evaluation of nonverbal vocalizations in TTS is now possible with NV-Bench, a new benchmark that treats NVs as communicative acts, not just acoustic artifacts.

Qinke Ni, Huan Liao, Dekun Chen +2

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Chang Chen +42w ago·also MiLM Plus

Neural Network-Based Time-Frequency-Bin-Wise Linear Combination of Beamformers for Underdetermined Target Source Extraction

Ditch hand-tuned beamformer combinations: a neural network with cross-attention learns spectrally coherent weights for improved target source extraction in noisy audio mixtures.

Chang Chen, Changda Chen, Yichen Yang +2

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Jaesung Bae +42w ago

Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech

Overcome the scarcity of labeled data in dysarthric speech quality assessment with a novel data augmentation framework that leverages unlabeled data and outperforms state-of-the-art methods.

Jaesung Bae, Xiuwen Zheng, Minje Kim +2

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

2w ago·also Meta AI, UCF

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

Ditch the text prompts: AC-Foley uses reference audio to synthesize video sound effects with unprecedented control, enabling timbre transfer and zero-shot generation.

Peng Fang, Pengjun Fang, Yingqing He +4

Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

Jingxiang Chen +152w ago·also SMU

Speech LLMs can now better understand your emotions: a new RL approach boosts paralinguistic understanding by 8-12% over state-of-the-art models.

Jingxiang Chen, Minseok Kim, Seong-Gyun Leem +13

Reasoning & Chain-of-Thought RLHF & Preference Learning Speech & Audio

Shan Jiang +42w ago

Two-Stage Adaptation for Non-Normative Speech Recognition: Revisiting Speaker-Independent Initialization for Personalization

Personalizing ASR for atypical speech gets a boost: pre-training on multi-speaker atypical data before speaker-specific fine-tuning significantly improves performance.

Shan Jiang, Jiawen Qi, Chuanbing Huo +2

Tagarela - A Portuguese speech dataset from podcasts

NVIDIA2w ago·also Federal University of Goias (UFG), Federal University of Mato Grosso (UFMT), Paulista State University (UNESP)

Rivaling English's GigaSpeech in scale, TAGARELA unlocks the potential for state-of-the-art Portuguese speech models with its nearly 9,000 hours of podcast audio.

Frederico Santos de Oliveira, Lucas Rafael Stefanel Gris, Alef Iury Siqueira Ferreira +8

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Fanfei Meng2w ago

Customizing ChatGPT for Second Language Speaking Practice: Genuine Support or Just a Marketing Gimmick?

Prompt engineering can significantly enhance ChatGPT's ability to provide balanced feedback and emotional support in ESL speaking practice, though culturally responsive teaching remains a challenge.

Fanfei Meng

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

2w ago

Vietnamese Automatic Speech Recognition: A Revisit

A new pipeline turns noisy, inconsistent open-source data into a 500-hour, high-quality Vietnamese ASR dataset, finally giving researchers a solid base for building better speech recognition.

Thi Vu, Linh The Nguyen, Dat Quoc Nguyen

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Federico Nocentini +42w ago·also Media Integration and Communication Center (MICC)

FreeTalk: Emotional Topology-Free 3D Talking Heads

Unleashing realistic 3D talking heads on *any* face scan, FreeTalk breaks free from template meshes and rigid topologies, even capturing nuanced emotional expressions.

Federico Nocentini, Thomas Besnier, Claudio Ferrari +2

VorTEX: Various overlap ratio for Target speech EXtraction

Roji Oh +32w ago

Existing target speech extraction models falter when speech overlap varies, exhibiting suppression or residual interference, but VorTEX maintains high separation fidelity across a wide range of overlap ratios.

Roji Oh, Ro-hoon Oh, Jihwan Seol +1

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Sachin Prajuli +32w ago

Music Genre Classification: A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches

A sequential CNN-RNN architecture achieves 84% accuracy in classifying eight Nepali music genres, substantially outperforming classical machine learning methods and other deep learning architectures on a newly constructed dataset.

Sachin Prajuli, Abhishek Karna, A. Karna +1

Cepstral Smoothing of binary masks for convolutive blind separation of speech mixtures

Ibrahim Missaoui +22w ago

Cepstral smoothing can significantly reduce musical noise artifacts in blind source separation of speech mixtures.

Ibrahim Missaoui, Zied Lachiri, Z. Lachiri

WhispSynth: Scaling Multilingual Whisper Corpus through Real Data Curation and A Novel Pitch-free Generative Framework

2w ago

A new synthetic whispered speech corpus, WhispSynth, closes the data gap in text-to-whisper research by achieving naturalness scores on par with real recordings.

Tianyi Tan, Jiaxin Ye, Yuanming Zhang +4

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Mar 15, 2026

Mohammad Javad Ranjbar Kalahroodi +62w ago

PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

Current audio-language models are culturally tone-deaf: they can't even detect Persian poetry meter, despite crushing English speech tasks.

Mohammad Javad Ranjbar Kalahroodi, M. Amini, Mohammad Amini +4

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Lok-Lam Ieong +62w ago

Nudging Hidden States: Training-Free Model Steering for Chain-of-Thought Reasoning in Large Audio-Language Models

Text-derived "nudges" can steer the reasoning of speech-based AI models, boosting accuracy by up to 4.4% without any training.

Lok-Lam Ieong, Chia-Chien Chen, Chih-Kai Yang +4

Multimodal Models Reasoning & Chain-of-Thought Speech & Audio

2w ago·also Shenzhen Loop Area Institute

Controllable Accent Normalization via Discrete Diffusion

Achieve accent normalization with interpretable and controllable accent strength by selectively reusing self-supervised speech tokens via masked discrete diffusion.

Qibing Bai, Yuhan Du, Tom Ko +2

Echoes Across Centuries: Phonetic Signatures of Persian Poets

Kourosh Shahnazari +22w ago

Persian poets exhibit distinct phonetic signatures that transcend meter and individual style, evolving across centuries with shifts in genre and literary context.

Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar

Semi-Automatic Flute Robot and Its Acoustic Sensing

Hikari Kuriyama +32w ago

A flute-playing robot achieves automated fingering and register-dependent embouchure assistance without requiring human embouchure control, opening new avenues for musical instrument automation.

Hikari Kuriyama, Hiroaki Sonoda, Kouki Tomiyoshi +1

Robotics & Embodied AI Speech & Audio

Deok-Hyeon Cho +32w ago

Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations

Injecting nonverbal cues like laughter and sighs into speech synthesis is now more expressive and natural, thanks to a novel training strategy that overcomes data scarcity.

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim +1

CodecMOS-Accent: A MOS Benchmark of Resynthesized and TTS Speech from Neural Codecs Across English Accents

Wen-Chin Huang +22w ago

Accented speech reveals perceptual biases in speech synthesis evaluation: listeners rate speakers with matching accents as more natural.

Wen-Chin Huang, Nicholas Sanders, Erica Cooper

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

FPT Software AI Center2w ago·also KAIST, University of Alabama at Birmingham

DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

Achieve more natural and synchronized video dubbing by conditioning a discrete flow matching TTS model on facial expressions and cross-modal alignment.

Ngoc-Son Nguyen, Thanh V. T. Tran, Jeongsoo Choi +3

Multimodal Models Natural Language Processing Speech & Audio

Mar 13, 2026

Nikita Torgashov +22w ago

VoXtream2: Full-stream TTS with dynamic speaking rate control

Control speaking rate on the fly in your TTS system with VoXtream2, which hits 4x real-time speeds and 74ms latency.

Nikita Torgashov, G. Henter, G. Skantze

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Mar 12, 2026

Ivan Yakovlev +22w ago

ReDimNet2: Scaling Speaker Verification via Time-Pooled Dimension Reshaping

Time-pooled dimension reshaping unlocks more efficient scaling of speaker verification models, achieving state-of-the-art accuracy on VoxCeleb1 at a fraction of the computational cost.

Ivan Yakovlev, A. Okhotnikov, Anton Okhotnikov

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

Te Zeng2w ago

Enhancing Music Recommendation with User Mood Input

Injecting user mood into music recommendation boosts perceived quality, proving that personalized listening experiences can be significantly improved by considering emotional state.

Te Zeng

Recommendation & Information Retrieval Speech & Audio

2w ago·also NatWest AI Research

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

Despite the intuition that noisy environments should make models rely more on visual cues, AVSR models stubbornly cling to audio, even when it's heavily degraded.

Umberto Cappellazzo, Stavros Petridis, Maja Pantic

Interpretability & Mechanistic Interp Multimodal Models Speech & Audio

Suvendu Sekhar Mohanty2w ago

Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2

Synthesize speech with unprecedented emotional control: a new causal training method lets you edit prosody "counterfactually" to express different emotions in the same utterance.

Suvendu Sekhar Mohanty

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

2w ago·also USC

Affect Decoding in Phonated and Silent Speech Production from Surface EMG

You can reliably decode frustration from facial muscle activity, even when people aren't speaking aloud.

Simon Pistrosch, Kleanthis Avramidis, Tiantian Feng +4

Silent Speech Interfaces in the Era of Large Language Models: A Comprehensive Taxonomy and Systematic Review

2w ago·also SFU

LLMs are enabling silent speech interfaces to finally approach the word error rate threshold needed for real-world use by mapping fragmented physiological gestures into structured semantic latent spaces.

Kele Xu, Yifan Wang, Ming Feng +5

Natural Language Processing Speech & Audio Tool Use & Agents

Roman Koshkin +62w ago

Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

Ditch the heuristics: Hikari achieves state-of-the-art simultaneous speech translation by learning READ/WRITE decisions directly through a probabilistic WAIT token.

Roman Koshkin, Je， Haesung, Lianbo Liu +4

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Mayank Saini Arit Kumar Bishwas2w ago

One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries

Forget rigid decision trees: a dynamically orchestrated agent slashes multimodal query processing costs by 67% while boosting speed and reducing rework.

Mayank Saini Arit Kumar Bishwas

Multimodal Models Speech & Audio Tool Use & Agents

J. Byeon +32w ago

Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

Achieve state-of-the-art emotion recognition by fusing visual and audio cues with a bi-directional cross-attention mechanism, outperforming unimodal approaches.

J. Byeon, Junhyeong Byeon, Jeongyeol Kim +1

Speak or Stay Silent: Context-Aware Turn-Taking in Multi-Party Dialogue

Kratika Bhagtani +42w ago

LLMs can't tell when to shut up in multi-party conversations, but fine-tuning with reasoning traces can teach them some manners.

Kratika Bhagtani, Mrinal Anand, Yu Xu +2

Natural Language Processing Speech & Audio Tool Use & Agents

Xiangyuan Xue +62w ago

Edge-Cloud Collaborative Speech Emotion Captioning via Token-Level Speculative Decoding in Audio-Language Models

Achieve a 62.7% BLEU score boost in speech emotion captioning by offloading only the trickiest parts of the problem to the cloud.

Xiangyuan Xue, Jiajun Lu, Jia-Liang Lu +4

Inference & Quantization Multimodal Models Speech & Audio

2w ago

Resurfacing Paralinguistic Awareness in Large Audio Language Models

Fine-tuning LALMs on just the right layers, guided by layer-wise analysis, unlocks better paralinguistic understanding than naively fine-tuning everything.

Hao Yang, Minghan Wang, Tongtong Wu +3

Multimodal Models Natural Language Processing Speech & Audio

Joonyong Park +12w ago

AnimeScore: A Preference-Based Dataset and Framework for Evaluating Anime-Like Speech Style

Forget MOS: a new preference-based metric, AnimeScore, finally cracks the code for automatically evaluating "anime-like" speech with 90.8% AUC.

Joonyong Park, Jerry Li

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Speech & Audio

Yubeen Lee +32w ago

Stage-Adaptive Reliability Modeling for Continuous Valence-Arousal Estimation

By explicitly modeling and adapting to the reliability of audio and visual signals at different interaction stages, SAGE achieves more stable emotion estimation under cross-modal noise and occlusion.

Yubeen Lee, Sangeun Lee, Junyeop Cha +1

Multimodal Models Speech & Audio

Jiajun Sun +22w ago

A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition

DINOv2 visual features and Wav2Vec 2.0 audio features can be effectively fused in a two-stage model to achieve state-of-the-art facial expression recognition in challenging, unconstrained video conditions.

Jiajun Sun, Zhen Gao, Zhe Gao

Reconstruction of the Vocal Tract from Speech via Phonetic Representations Using MRI Data

Sofiane Azzouz +32w ago

Expert-corrected phonetic transcriptions can approach the performance of MFCCs for vocal tract reconstruction from speech, suggesting phonetic information is a viable alternative to acoustic features.

Sofiane Azzouz, P. Vuissoz, Pierre-André Vuissoz +1

Acoustic-to-Articulatory Inversion of Clean Speech Using an MRI-Trained Model

Sofiane Azzouz +32w ago

You can reconstruct vocal tract shapes from clean speech almost as well as from noisy MRI recordings, opening the door to more practical articulatory analysis.

Sofiane Azzouz, P. Vuissoz, Pierre-André Vuissoz +1