April 24 – May 1, 2026

Speech & Audio - Weekly Roundup

50 papers published across 5 labs.

Selected Labs publishing this week

Tsinghua AI1 NVIDIA1 Microsoft Research1 UW1 Amazon Science1

Top Papers

Apr 30, 2026

LS2N -Nantes University (3w ago·also LIA -Avignon University, LIUM -Le Mans University (, Nantes University

Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition

WER hides the real story: new metrics reveal how language model rescoring in ASR impacts grammatical correctness and semantic accuracy.

Thibault Bañeras-Roux, Mickaël Rouvier, Mickael Rouvier +210

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Nantes University3w ago·also Avignon University, LIA -Avignon University, LIUM -Le Mans University (, LS2N -Nantes University (

HATS: An Open Data Set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics

Current ASR metrics, even those leveraging embeddings, fail to align with human perception of transcription quality, as revealed by a new human-annotated dataset.

Thibault Bañeras Roux, Thibault Bañeras-Roux, Jane Wottawa +4

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Apr 28, 2026

Venkata Pushpak Teja Menta3w ago

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

Achieve near-native Indic TTS from a non-Indic base model at zero commercial-training-data cost by cleverly combining phoneme space unification, LoRA adaptation, and voice-prompt recovery.

Venkata Pushpak Teja Menta

Natural Language Processing Open-Source Models & Weights Speech & Audio

Venkata Pushpak Teja Menta3w ago

PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

Commercial TTS systems nailing WER scores can still butcher Indic accents, especially retroflex articulation, and this new benchmark exposes exactly where they fail.

Venkata Pushpak Teja Menta

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

May 1, 2026

Venkata Pushpak Teja Menta3w ago

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Speaker embeddings leak script information, especially when projecting Western voices into Indic scripts, but LASE fixes this with a language-adversarial training objective.

Venkata Pushpak Teja Menta

Natural Language Processing Red-Teaming & Adversarial Robustness Speech & Audio

All Papers (50)

May 1, 2026

Venkata Pushpak Teja Menta3w ago

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Speaker embeddings leak script information, especially when projecting Western voices into Indic scripts, but LASE fixes this with a language-adversarial training objective.

Venkata Pushpak Teja Menta

Natural Language Processing Red-Teaming & Adversarial Robustness Speech & Audio

Apr 30, 2026

LS2N -Nantes University (3w ago·also LIA -Avignon University, LIUM -Le Mans University (, Nantes University

Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition

WER hides the real story: new metrics reveal how language model rescoring in ASR impacts grammatical correctness and semantic accuracy.

Thibault Bañeras-Roux, Mickaël Rouvier, Mickael Rouvier +210

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Zujin Guo +63w ago

Generate Your Talking Avatar from Video Reference

Ditch the static image: this method generates realistic talking avatars by learning from *videos* of the subject in completely different scenes.

Zujin Guo, Zhenhui Ye, Yi Ren +4

Computer Vision Multimodal Models Speech & Audio

Doyeop Kwak +33w ago

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

Visual cues become crucial for speech recognition when audio quality tanks in this challenging new benchmark derived from real-world conversations.

Doyeop Kwak, Jeongsoo Choi, Suyeon Lee +1

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Yurii Halychanskyi +23w ago

Accent Conversion: A Problem-Driven Survey of Sociolinguistic and Technical Constraints

Successfully converting accents requires balancing accent modification with speaker identity preservation, a challenge that this survey unpacks by tracing the evolution of techniques from DSP to neural methods.

Yurii Halychanskyi, Jianfeng Steven Guo, Volodymyr Kindratenko

Natural Language Processing Speech & Audio

Nazar Kozak3w ago

Predicting Upcoming Stuttering Events from Three-Second Audio: Stratified Evaluation Reveals Severity-Selective Precursors, and the Model Deploys Fully On-Device

Stuttering isn't random: you can predict severe blocks and sound repetitions from just 3 seconds of audio with a tiny model that runs on your phone.

Nazar Kozak

Natural Language Processing Speech & Audio

Yurii Halychanskyi +63w ago·also UIUC

Few-Shot Accent Synthesis for ASR with LLM-Guided Phoneme Editing

LLMs can guide phoneme editing to create synthetic accented speech from just a handful of examples, substantially improving ASR accuracy where training data is scarce.

Yurii Halychanskyi, Nimet Beyza Bozdag, Mark Hasegawa-Johnson +4

Natural Language Processing Speech & Audio Tool Use & Agents

Dominik Klement +53w ago·also Brno University of Technology

BUT System Description for CHiME-9 MCoRec Challenge

Integrating visual cues into a long-context ASR system slashes word error rate by 16% in multi-talker conversational recordings, proving the power of AV fusion.

Dominik Klement, Alexander Polok, Nguyen Hai Phong +3

Multimodal Models Natural Language Processing Speech & Audio

3w ago·also Norwegian University of Science and Technology, University of Palermo

A Knowledge-Driven Approach to Target Speech Extraction in the Presence of Background Sound Effects for Cinematic Audio Source Separation (CASS)

Unbury speech from cinematic sound effects by teaching the model to "listen" for how words are formed.

Chun-wei Ho, Sabato Marco Siniscalchi, Kai Li +1

Natural Language Processing Speech & Audio

Sharayu Nilesh Deshmukh +53w ago

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Current DeepFake detectors can be fooled by semantically inconsistent real audio and video, highlighting a critical blind spot in their ability to assess realistic manipulations.

Sharayu Nilesh Deshmukh, Kailash A. Hambarde, Joana C. Costa +3

Computer Vision Red-Teaming & Adversarial Robustness Speech & Audio

Earth Species Project3w ago

Beyond the Baseband: Adaptive Multi-Band Encoding for Full-Spectrum Bioacoustics Classification

Unlocking the full spectrum of animal sounds, previously discarded by standard audio models, can significantly improve bioacoustic classification.

Eklavya Sarkar, Marius Miron, David Robinson +8

Speech & Audio

Nina Seron-Abouelfadil +33w ago

Normativity and Productivism: Ableist Intelligence? A Degrowth Analysis of AI Sign Language Translation Tools for Deaf People

AI sign language translation tools, despite their promise, may actually reinforce ableism by prioritizing technical standardization over the cultural and linguistic nuances of Deaf communication.

Nina Seron-Abouelfadil, Nina Seron-Abouelfadil, Poppy Fynes +1

Constitutional AI & AI Ethics Natural Language Processing Speech & Audio

Eugen Beck +103w ago

AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR

General American English ASR performance doesn't guarantee similar accuracy across other English accents, as revealed by a new multi-accent call center dataset.

Eugen Beck, E. Beck, Sarah Beranek +8

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing+1

Nantes University3w ago·also Avignon University, LIA -Avignon University, LIUM -Le Mans University (, LS2N -Nantes University (

HATS: An Open Data Set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics

Current ASR metrics, even those leveraging embeddings, fail to align with human perception of transcription quality, as revealed by a new human-annotated dataset.

Thibault Bañeras Roux, Thibault Bañeras-Roux, Jane Wottawa +4

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Jullajak Karnjanaekarin +73w ago

JaiTTS: A Thai Voice Cloning Model

Thai voice cloning just leapfrogged human performance on short-duration speech, thanks to a new model that directly handles code-switching and numerals.

Jullajak Karnjanaekarin, Pontakorn Trakuekul, Narongkorn Panitsrisit +5

Natural Language Processing Open-Source Models & Weights Speech & Audio

Apr 29, 2026

Trinnov Audio3w ago

Full band denoising of room impulse response in the wavelet domain with dictionary learning

Achieve significantly better room acoustics analysis by extending wavelet denoising to low frequencies.

Théophile Dupré, Romain Couderc, Miguel Moleron +3

Speech & Audio

Shuhao Xu +53w ago·also Corresponding Author, HKUST, Huawei

EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

Forget static emotion labels – EmoTransCap lets you generate speech captions that actually track how emotions evolve in a conversation.

Shuhao Xu, Yifan Hu, Jingjing Wu +3

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

UBA-CONICET3w ago·also Universidad de Chile, Universidad de San Andrés

A Toolkit for Detecting Spurious Correlations in Speech Datasets

Discover hidden biases in your speech datasets: this toolkit uses non-speech audio to reveal spurious correlations that inflate performance metrics.

Lara Gauder, Pablo Riera, Andrea Slachevsky +3

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Speech & Audio

3w ago·also Tencent AI

Diffusion Reconstruction towards Generalizable Audio Deepfake Detection

Audio deepfake detectors trained on diffusion-reconstructed "hard" examples generalize far better to unseen attacks, slashing error rates compared to standard training.

Bo Cheng, Songjun Cao, Xiaoming Zhang +3

Red-Teaming & Adversarial Robustness Speech & Audio

3w ago

DiffAnon: Diffusion-based Prosody Control for Voice Anonymization

Finally, voice anonymization offers a smooth, tunable knob to balance privacy and prosody, instead of forcing you to pick just one.

Ismail Rasim Ulgen, Ismail Rasim Ulgen, Zexin Cai +4

Natural Language Processing Speech & Audio

Independent Researcher3w ago

Recurrence-Based Nonlinear Vocal Dynamics as Digital Biomarkers for Depression Detection from Conversational Speech

Depression leaves a detectable fingerprint in the way our vocal system revisits acoustic states during conversation, revealing new avenues for digital biomarkers.

H. Samanta, Himadri S Samanta

Natural Language Processing Scientific Discovery & Drug Design Speech & Audio

Tsinghua AI3w ago·also Tencent AI

SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

Semantic priors in neural speech codecs hit a wall: their benefits plateau beyond 6 kbps, revealing a fundamental limit to improving intelligibility at higher bitrates.

Mingyu Zhao, Zijian Lin, Kun Wei +2

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing+1

3w ago·also Gilbert AI Lab, USC

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

Widely used emotion embedding similarity metrics for speech generation are more sensitive to speaker and linguistic features than actual emotion, rendering them unreliable for evaluating emotional expressiveness.

Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou +8

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

3w ago·also CAS, SJTU

Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification

Adversarial training doesn't have to hurt speaker verification: by explicitly modeling language, you can disentangle speaker and language characteristics without sacrificing speaker discriminability.

Qituan Shangguan, Junhao Du, Kunyang Peng +4

Natural Language Processing Red-Teaming & Adversarial Robustness Speech & Audio

Tobias Bystrich +33w ago

Selective Augmentation: Improving Universal Automatic Phonetic Transcription via G2P Bootstrapping

Transferring phonetic knowledge from one language to another can dramatically improve automatic phonetic transcription, even enabling the recognition of entirely new phonetic features.

Tobias Bystrich, Julia M. Pritzen, Christoph A. Schmidt +1

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Apr 28, 2026

Yuxin Zhang +213w ago

Step-Audio-R1.5 Technical Report

RLVR, the dominant training paradigm for audio language models, may be turning them into unfeeling "answering machines" that excel on benchmarks but fail the vibe check.

Yuxin Zhang, Xiangyu Tony Zhang, Xiangyu Zhang +19

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning+1

Venkata Pushpak Teja Menta3w ago

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

Achieve near-native Indic TTS from a non-Indic base model at zero commercial-training-data cost by cleverly combining phoneme space unification, LoRA adaptation, and voice-prompt recovery.

Venkata Pushpak Teja Menta

Natural Language Processing Open-Source Models & Weights Speech & Audio

Nankai University3w ago·also NJUST, PKU, Tongyi Lab

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

Skip the bulky bidirectional teacher: this new method trains a fast, causal audio-video generator directly, slashing sampling steps while maintaining top-tier quality.

Yupeng Zhou, Yupeng Zhou, Lianghua Huang +17

Computer Vision Multimodal Models Speech & Audio

3w ago·also Edinburgh, MBZUAI

Unrequited Emotions: Investigating the Gaps in Motivation and Practice in Speech Emotion Recognition Research

SER's noble aspirations of voice-activated healthcare are undermined by datasets that bear little resemblance to real-world emotional expression.

Taryn Wong, Zeerak Talat, Hanan Aldarmaki +1

Constitutional AI & AI Ethics Natural Language Processing Speech & Audio

Ke Wang +33w ago

ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations

Emotion recognition can be significantly improved by adapting to individual expressive traits, with ML-SAN outperforming static models in capturing nuanced emotional expressions.

Ke Wang, Kexue Wang, Yinfeng Yu +1

Multimodal Models Natural Language Processing Speech & Audio

Tianshui Chen +53w ago

Personalized Cross-Modal Emotional Correlation Learning for Speech-Preserving Facial Expression Manipulation

Overcome the scarcity of paired data in speech-preserving facial expression manipulation by personalizing visual-language model prompts with individual visual information and correlating changes in visual and semantic features.

Tianshui Chen, Yujie Zhu, Jianman Lin +3

Computer Vision Multimodal Models Speech & Audio

3w ago

Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

Explicitly modeling the dependency between dialogue context and current utterance as an "interpretation cue" significantly boosts conversational multimodal understanding.

Zhaoyan Pan, Hengyang Zhou, Xiangdong Li +5

Multimodal Models Natural Language Processing Speech & Audio

Ming-Chen Huang +103w ago

ASAP: An Azimuth-Priority Strip-Based Search Approach to Planar Microphone Array DOA Estimation in 3D

Key contribution not extracted.

Ming-Chen Huang, Ming Huang, Shuting Xu +8

Speech & Audio

Qazvin Islamic Azad University3w ago·also Islamic Azad University

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

WhisperPipe achieves 3-5x lower latency than existing streaming ASR solutions while consuming significantly less memory, making it a game-changer for real-time applications.

Erfan Ramezani, E. Ramezani, Mohammad Mahdi Giahi +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Speech & Audio

Chun-Yi Kuan +23w ago

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

Semantic-level uncertainty estimation methods significantly enhance the reliability of audio-aware language models, outperforming traditional approaches in critical reasoning tasks.

Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Ji-Eun Kim +13w ago

Korean aegyo speech shows systematic F1 increase to signal childlike qualities

Want to sound cute? Korean speakers systematically raise their F1 formant when using "aegyo" speech, effectively mimicking a smaller vocal tract.

Ji-Eun Kim, V. Dellwo

Natural Language Processing Speech & Audio

Xuzheng He +93w ago

SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton

SymphonyGen's 3D hierarchical approach to music generation lets you steer the overall structure of a symphony without sacrificing the richness and detail of the orchestration.

Xuzheng He, Nan Nan, Zhilin Wang +7

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

Jaskirat Sudan +33w ago

Angular similarity in supervised contrastive learning can match the performance of cosine similarity for deepfake audio detection, but with significantly less reliance on computationally expensive negative sampling.

Jaskirat Sudan, Hashim Ali, Surya Subramani +1

Speech & Audio Training Efficiency & Optimization

A. Abebe3w ago

One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

Synthetically generated data from multi-model ensemble distillation can significantly boost the intelligibility of cross-lingual voice cloning systems for scientific speech without sacrificing speaker similarity.

A. Abebe

Natural Language Processing Speech & Audio

D. Gogoi +23w ago

Cross-Linguistic Rhythmic and Spectral Feature-Based Analysis of Nyishi and Adi: Two Under-Resourced Languages of Arunachal Pradesh

Under-resourced languages can be accurately differentiated using rhythm alone, but combining rhythmic and spectral features unlocks even higher classification accuracy.

D. Gogoi, P. Gogoi, Yang Saring

Natural Language Processing Speech & Audio

Chong-Xin Gan +73w ago·also PolyU

UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition

Speaker recognition accuracy improves dramatically when leveraging a U-Net-based fusion of noisy and enhanced speech, coupled with a novel training strategy.

Chong-Xin Gan, Peter Bell, Man-Wai Mak +5

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

Yichen Wang +13w ago

Hu\'i S\`u: Co-constructing a Dual Feedback Apparatus

Forget predictable AI tools – this performance co-creates music through entangled feedback loops between humans and AI instruments, blurring the lines of agency.

Yichen Wang, C. Martin

Speech & Audio Tool Use & Agents

Venkata Pushpak Teja Menta3w ago

PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

Commercial TTS systems nailing WER scores can still butcher Indic accents, especially retroflex articulation, and this new benchmark exposes exactly where they fail.

Venkata Pushpak Teja Menta

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Apr 27, 2026

NVIDIA3w ago·also Amazon Science, Microsoft Research, UW, Music X Lab +1

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.

Nvidia Amala Sanjay Deshmukh, K. Chumachenko, Tuomas Rintamaki +209

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Liang Xu +43w ago

Speech Enhancement Based on Drifting Models

Ditch the slow sampling: DriftSE achieves state-of-the-art speech enhancement in a single step, outperforming diffusion models with a novel equilibrium-based approach.

Liang Xu, Diego Caviedes-Nozal, B. Kleijn +2

Speech & Audio Training Efficiency & Optimization

Leekyung Kim +13w ago

An Event-Based Sequence Modeling Approach to Recognizing Non-Triad Chords with Oversegmentation Minimization

Segmenting music into meaningful chunks and predicting chords sequence-to-sequence boosts recognition accuracy, especially for those pesky, rare non-triad chords that plague existing systems.

Leekyung Kim, Jonghun Park

Natural Language Processing Speech & Audio

3w ago

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

ASR systems can now be more trustworthy: this work shows how to train them to abstain from transcribing uncertain segments, leading to more reliable outputs.

Wen-Chin Huang, Yuhang Qiu, Bohan Li +5

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

3w ago

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

Audio-Language models are cheating on benchmarks, acing tests even when they barely listen.

Leonardo Haw-Yang Foo, Chih-Kai Yang, Chen-An Li +2

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Institut Polytechnique de Paris3w ago

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

Shrinking massive audio foundation models by up to 61x is now possible without significant performance loss, thanks to a novel self-supervised distillation approach that works directly on embeddings.

Mohammed Ali El Adlouni, Aurian Quelennec, Pierre Chouteau +2

Inference & Quantization Speech & Audio Training Efficiency & Optimization

Apr 26, 2026

Zhen Ye +103w ago

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Disentangling high-level cross-modal reasoning from low-level modality-specific refinement in talking head generation yields superior lip-sync accuracy, video quality, and audio quality compared to entangled approaches.

Zhen Ye, Xu Tan, Aoxiong Yin +8

Computer Vision Multimodal Models Speech & Audio