March 4 – March 11, 2026

Speech & Audio - Weekly Roundup

100 papers published across 8 labs.

17% acceleration

Selected Labs publishing this week

Tsinghua AI3 Google Research2 CMU ML2 Amazon Science1 Stanford HAI1

Top Papers

Mar 11, 2026

Jing Peng +93w ago·also M-A-P

G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

G-STAR tackles long-form, multi-speaker ASR by giving Speech-LLMs time-aware speaker tracking, enabling robust identity linking across chunks.

Jing Peng, Ziyi Chen, Haoyu Li +7

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Zhenghai You +33w ago

Training-Free Multi-Step Inference for Target Speaker Extraction

Iteratively refining target speaker extraction *without* retraining a model unlocks significant performance gains, offering a flexible and efficient approach to speech separation.

Zhenghai You, Ying Shi, Lantian Li +1

Inference & Quantization Speech & Audio

3w ago

Probabilistic Verification of Voice Anti-Spoofing Models

Uncover the hidden vulnerabilities of your voice anti-spoofing model with a new tool that quantifies the probability of failure against unseen speech synthesis attacks.

E. Kushnir, A. Kozodaev, Dmitrii Korzh +3

Red-Teaming & Adversarial Robustness Speech & Audio

Amirbek Djanibekov +33w ago·also Fondazione Bruno Kessler

SimulU: Training-free Policy for Long-form Simultaneous Speech-to-Speech Translation

Skip the training: SimulU achieves state-of-the-art simultaneous speech translation by cleverly exploiting pre-trained models, opening the door to truly plug-and-play multilingual communication.

Amirbek Djanibekov, L. Bentivogli, Matteo Negri +1

Natural Language Processing Speech & Audio

Yongpeng Yan +33w ago

PRoADS: Provably Secure and Robust Audio Diffusion Steganography with latent optimization and backward Euler Inversion

Achieve near-perfect audio steganography even under heavy MP3 compression by optimizing latent reconstruction and diffusion inversion errors.

Yongpeng Yan, Yanan Li, Qiyang Xiao +1

Computer Vision Red-Teaming & Adversarial Robustness Speech & Audio

All Papers (100)

Mar 11, 2026

Jing Peng +93w ago·also M-A-P

G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

G-STAR tackles long-form, multi-speaker ASR by giving Speech-LLMs time-aware speaker tracking, enabling robust identity linking across chunks.

Jing Peng, Ziyi Chen, Haoyu Li +7

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Zhenghai You +33w ago

Training-Free Multi-Step Inference for Target Speaker Extraction

Iteratively refining target speaker extraction *without* retraining a model unlocks significant performance gains, offering a flexible and efficient approach to speech separation.

Zhenghai You, Ying Shi, Lantian Li +1

Inference & Quantization Speech & Audio

3w ago

Probabilistic Verification of Voice Anti-Spoofing Models

Uncover the hidden vulnerabilities of your voice anti-spoofing model with a new tool that quantifies the probability of failure against unseen speech synthesis attacks.

E. Kushnir, A. Kozodaev, Dmitrii Korzh +3

Red-Teaming & Adversarial Robustness Speech & Audio

Amirbek Djanibekov +33w ago·also Fondazione Bruno Kessler

SimulU: Training-free Policy for Long-form Simultaneous Speech-to-Speech Translation

Skip the training: SimulU achieves state-of-the-art simultaneous speech translation by cleverly exploiting pre-trained models, opening the door to truly plug-and-play multilingual communication.

Amirbek Djanibekov, L. Bentivogli, Matteo Negri +1

Natural Language Processing Speech & Audio

Yongpeng Yan +33w ago

PRoADS: Provably Secure and Robust Audio Diffusion Steganography with latent optimization and backward Euler Inversion

Achieve near-perfect audio steganography even under heavy MP3 compression by optimizing latent reconstruction and diffusion inversion errors.

Yongpeng Yan, Yanan Li, Qiyang Xiao +1

Computer Vision Red-Teaming & Adversarial Robustness Speech & Audio

3w ago·also Google Research

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Forget paired video-music training data: V2M-Zero aligns video and music by matching the *timing* of changes within each modality, not the content itself.

Yan-Bo Lin, Jonah Casebeer, Long Mai +3

Computer Vision Multimodal Models Speech & Audio

Joshua Jansen van Vuren +63w ago

Cough activity detection for automatic tuberculosis screening

You can now automatically isolate coughs from audio with 96% precision using just the first three layers of a pre-trained XLS-R model, paving the way for smartphone-based TB screening.

Joshua Jansen van Vuren, D. Parihar, Daphne Naidoo +4

Scientific Discovery & Drug Design Speech & Audio

G. Saon +53w ago

Self-Speculative Decoding for LLM-based ASR with CTC Encoder Drafts

LLM-based ASR can be sped up by 4.4x with minimal accuracy loss by using a CTC encoder to speculatively generate draft transcriptions.

G. Saon, Samuel Thomas, Takashi Fukuda +3

Inference & Quantization Natural Language Processing Speech & Audio

Amazon Science3w ago

When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

LoRA fine-tuning can significantly boost the voice cloning capabilities of LLM-based TTS systems, but only if the training data is acoustically diverse enough.

Anupam Purwar, Aditya Choudhary

Natural Language Processing Speech & Audio Training Efficiency & Optimization

Yuanbo Hou +53w ago

Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context

Geospatial context is a surprisingly effective prior for audio tagging, especially when sounds are acoustically similar, leading to improved performance over audio-only methods.

Yuanbo Hou, Yanru Wu, Qiaoqiao Ren +3

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

3w ago·also NTU Taiwan

MOS-Bias: From Hidden Gender Bias to Gender-Aware Speech Quality Assessment

Speech quality assessment is skewed: male listeners consistently give higher scores than female listeners, and standard MOS models learn and perpetuate this bias.

Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang +5

Constitutional AI & AI Ethics Natural Language Processing Speech & Audio

3w ago

Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition

Explicitly aligning audio and video streams in a multimodal Transformer boosts emotion recognition, showing that ignoring frame-rate differences hurts performance.

Inyong Koo, Yeeun Seong, Minseok Son +2

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Nolan Chan +43w ago

V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

Human-preference aligned audio generation from video is now possible, as V2A-DPO surpasses previous methods by directly optimizing for semantic consistency, temporal alignment, and perceptual quality.

Nolan Chan, Timmy Gang, Yongqian Wang +2

Multimodal Models RLHF & Preference Learning Speech & Audio

Lin Zhang +73w ago

Can LLMs Help Localize Fake Words in Partially Fake Speech?

LLMs can spot fake words in speech by recognizing common editing patterns, but this reliance on learned biases hinders generalization to new manipulation techniques.

Lin Zhang, Thomas Thebaud, Zexin Cai +5

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

3w ago·also Shenzhen Loop Area Institute

AlphaFlowTSE: One-Step Generative Target Speaker Extraction via Conditional AlphaFlow

Ditch slow, multi-step sampling for target speaker extraction: AlphaFlowTSE achieves faster, one-step generation with improved speaker similarity and real-world generalization.

Duojia Li, Shuhan Zhang, Zihan Qian +5

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

3w ago

Speech Codec Probing from Semantic and Phonetic Perspectives

Speech tokenizers, despite being crucial for multimodal LLMs, primarily capture phonetic information, creating a semantic mismatch with text-derived semantics that hinders performance.

Xuan Shi, Chang Zeng, Tiantian Feng +3

Multimodal Models Natural Language Processing Speech & Audio

3w ago

VoxCare: Studying Natural Communication Behaviors of Hospital Caregivers through Wearable Sensing of Egocentric Audio

Wearable sensors and speech AI can now unobtrusively reveal the hidden communication dynamics driving hospital caregiver workload and stress.

Tiantian Feng, Kleanthis Avramidis, Anfeng Xu +3

Natural Language Processing Speech & Audio

3w ago

Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning

Speech deepfake detection gets a reasoning upgrade: HIR-SDD uses chain-of-thought prompting with Large Audio Language Models to not only detect fakes but also explain *why* it thinks they're fake.

Artem Dvirniak, E. Kushnir, Dmitrii Tarasov +5

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness Speech & Audio

3w ago

Huntington Disease Automatic Speech Recognition with Biomarker Supervision

Adapting ASR models to Huntington's Disease speech not only improves accuracy, but also reveals how biomarker-based supervision can reshape error patterns in ways that reflect disease severity.

Charles L. Wang, Cady Chen, Ziwei Gong +1

Natural Language Processing Speech & Audio

3w ago

Distilling LLM Semantic Priors into Encoder-Only Multi-Talker ASR with Talker-Count Routing

Encoder-only multi-talker ASR can now rival LLM-based systems in accuracy while drastically reducing computational cost, thanks to a novel distillation approach and talker-count routing.

Hao Shi, Yusuke Fujita, Roman Koshkin +4

Inference & Quantization Natural Language Processing Speech & Audio

Yinfeng Xia +33w ago

Uni-ASR: Unified LLM-Based Architecture for Non-Streaming and Streaming Automatic Speech Recognition

A single LLM can now handle both non-streaming and streaming ASR, opening the door to more flexible and efficient speech recognition systems.

Yinfeng Xia, Junfeng Hou, Gaopeng Xu +1

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Hillary Mutisya +13w ago

Continued Pretraining for Low-Resource Swahili ASR: Achieving State-of-the-Art Performance with Minimal Labeled Data

You can slash ASR error rates in low-resource languages by over 60% with a simple continued pretraining recipe.

Hillary Mutisya, J. Mugane

Natural Language Processing Speech & Audio Training Efficiency & Optimization

Google Research3w ago·also Columbia, UMich

MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

Imagine an XR experience where you can selectively isolate and enhance individual sound sources in real-time, making chaotic audio environments crystal clear.

Tianyu Xu, Sieun Kim, Qianhuizhi Zheng +6

Computer Vision Multimodal Models Speech & Audio

Yujie Liao +43w ago

OSUM-Pangu: An Open-Source Multidimension Speech Understanding Foundation Model Built upon OpenPangu on Ascend NPUs

A fully open-source speech understanding model, OSUM-Pangu, proves that competitive performance is achievable on non-CUDA hardware, challenging the dominance of GPU-centric ecosystems.

Yujie Liao, Xuelong Geng, Hongfei Xue +2

Distributed Systems & Hardware Open-Source Models & Weights Speech & Audio

Kaituo Xu +83w ago

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

A single system now rivals or beats specialized models across ASR, voice activity detection, language ID, and punctuation, setting a new bar for industrial-grade speech processing.

Kaituo Xu, Yanchao Jia, Kai-Wei Huang +6

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Yangyang Qu +33w ago

Fair-Gate: Fairness-Aware Interpretable Risk Gating for Sex-Fair Voice Biometrics

Fair-Gate disentangles speaker identity and sex in voice biometrics, boosting fairness without sacrificing accuracy by explicitly routing features through identity and sex-specific pathways.

Yangyang Qu, Todisco Massimiliano, Galdi Chiara +1

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Speech & Audio

Thomas Thebaud +43w ago

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

Speech-aware LLMs are surprisingly bad at speaker verification, but a simple embedding injection trick closes the gap with dedicated systems while preserving the LLM's language abilities.

Thomas Thebaud, Yuzhe Wang, L. Moro-Velázquez +2

Eval Frameworks & Benchmarks Open-Source Models & Weights Speech & Audio

3w ago·also Figure, NTU

NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction

A nose-mounted microphone and vibration sensor combo unlocks robust, low-audibility speech interfaces for always-on AI interaction, even in noisy environments.

Jun Rekimoto, Yukino Nishimura, Bo Yang

Natural Language Processing Robotics & Embodied AI Speech & Audio

Mar 10, 2026

Stanford HAI3w ago

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.

Laya Iyer, Sanmi Koyejo

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

3w ago

Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition

A Goldilocks zone exists for neural audio codec quantization depth, where intermediate levels strike the best balance between suppressing adversarial noise and preserving speech content for robust ASR.

J. Prescott, Thanathai Lertpetchpun, Shrikanth S. Narayanan

Inference & Quantization Red-Teaming & Adversarial Robustness Speech & Audio

Xin Jing +53w ago

EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions

Tired of LLM judges hallucinating when evaluating long, detailed speech captions? EmoSURA offers a more reliable, audio-grounded alternative by verifying atomic perceptual units.

Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang +3

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Yun-Shao Tsai +73w ago

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

LALMs struggle to handle multiple concurrent audio inputs, but a simple input permutation strategy can significantly boost their multi-audio understanding without retraining.

Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai +5

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Soumya Dutta3w ago

Acoustic and Semantic Modeling of Emotion in Spoken Language

Controllable emotion style transfer in speech is now possible without needing paired data, opening new avenues for data augmentation and expressive AI.

Soumya Dutta

Multimodal Models Natural Language Processing Speech & Audio

Fermín Moscoso del Prado Martín +13w ago

Modelling the Diachronic Emergence of Phoneme Frequency Distributions

Statistical regularities in phoneme frequency distributions, previously thought to arise from optimization, may instead be natural consequences of diachronic sound change.

Fermín Moscoso del Prado Martín, Suchir Salhan

Natural Language Processing Scientific Discovery & Drug Design Speech & Audio

Zhifei Zhang +33w ago

End-to-End Direction-Aware Keyword Spotting with Spatial Priors in Noisy Environments

Spatial audio cues and directional priors can be jointly learned end-to-end to significantly boost keyword spotting accuracy in noisy environments, outperforming traditional cascaded approaches.

Zhifei Zhang, Yu Gao, Xiaofeng Mou +1

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Kirak Kim +13w ago

Finetuning a Text-to-Audio Model for Room Impulse Response Generation

Unlock realistic acoustic simulations with a text prompt: fine-tuning a text-to-audio model generates plausible room impulse responses, even with limited paired data.

Kirak Kim, Sungyoung Kim

Multimodal Models Natural Language Processing Speech & Audio

3w ago

A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition

Modern speech enhancement algorithms may not improve ASR performance in realistic noisy environments, challenging assumptions about their effectiveness in real-world applications.

Dimme de Groot, Yuanyuan Zhang, Jorge Martinez +1

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Apple ML3w ago

ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

Finally, a single model that can generate both your face and voice, convincingly controlled by text prompts and reference clips.

Aviad Dahan, Moran Yanuka, Noa Kraicer +2

Computer Vision Multimodal Models Speech & Audio

3w ago·also RWTH, University of Oldenburg

Distributed Multichannel Wiener Filtering for Wireless Acoustic Sensor Networks

Forget slow, iterative distributed signal estimation: dMWF achieves optimal multichannel Wiener filtering in wireless acoustic sensor networks without iteration, even when nodes observe different sources.

Paul Didier, Toon van Waterschoot, Simon Doclo +4

Distributed Systems & Hardware Speech & Audio

Z. Pang +23w ago

Paralinguistic Emotion-Aware Validation Timing Detection in Japanese Empathetic Spoken Dialogue

You can predict the best moment to offer emotional support just by listening to someone's voice, no text needed.

Z. Pang, Yahui Fu, Tatsuya Kawahara

Natural Language Processing Speech & Audio

Haoyuan Yang +43w ago

Emotion-Aware Prefix: Towards Explicit Emotion Control in Voice Conversion Models

Double the emotion conversion accuracy in voice conversion models with a simple prefix that jointly controls sequence modulation and acoustic realization.

Haoyuan Yang, Mu Yang, Jiamin Xie +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Jianing Yang +23w ago

DuplexCascade: Full-Duplex Speech-to-Speech Dialogue with VAD-Free Cascaded ASR-LLM-TTS Pipeline and Micro-Turn Optimization

Unlock full-duplex speech-to-speech dialogue without VAD limitations using chunk-wise micro-turns and special control tokens to steer LLM behavior in a cascaded pipeline.

Jianing Yang, Yusuke Fujita, Yui Sudo

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Maike Züfle +93w ago·also Fondazione Bruno Kessler

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

Text prompts might be inflating your SLLM's performance: spoken prompts reveal a significant performance gap, especially in low-resource languages.

Maike Züfle, Maike Zufle, Sara Papi +7

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Speech & Audio

Bunlong Lay +13w ago

A Fast Solver for Interpolating Stochastic Differential Equation Diffusion Models for Speech Restoration

Achieve comparable speech restoration quality with conditional diffusion models using 10x fewer neural network evaluations via a novel iSDE solver.

Bunlong Lay, Timo Gerkmann

Inference & Quantization Speech & Audio Training Efficiency & Optimization

DAMO3w ago

Logics-Parsing-Omni Technical Report

Transform unstructured audio-visual signals into machine-readable structured knowledge with the Logics-Parsing-Omni model, which enforces strict alignment between high-level semantics and low-level facts.

Computer Vision Multimodal Models Speech & Audio

Tsinghua AI3w ago·also NJU

TimberAgent: Gram-Guided Retrieval for Executable Music Effect Control

Forget tweaking knobs – this new Gram-matrix-based audio representation lets you *retrieve* the perfect, editable audio effect preset, outperforming standard methods.

Shihao He, Yihan Xia, Fang Liu +2

Recommendation & Information Retrieval Speech & Audio Tool Use & Agents

Maria Kunilovskaya +13w ago

EPIC-EuroParl-UdS: Information-Theoretic Perspectives on Translation and Interpreting

A meticulously curated, bidirectional English-German corpus of parliamentary proceedings now offers researchers a goldmine for dissecting the nuances of translation, interpreting, and language variation through an information-theoretic lens.

Maria Kunilovskaya, Christina Pollkläsener

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Petr Grinberg +13w ago

ALARM: Audio-Language Alignment for Reasoning Models

By cleverly "self-rephrasing" LLM outputs, this work coaxes reasoning LLMs to handle audio inputs without sacrificing their chain-of-thought abilities.

Petr Grinberg, Hassan Shahmohammadi

Multimodal Models Reasoning & Chain-of-Thought Speech & Audio

Hsiao-Ying Huang +13w ago

SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models

Forget confidence scores: a modality-aware early exit strategy for spoken language models slashes decoding costs without sacrificing accuracy or perceptual quality, revealing that speech tokens require specialized handling compared to text.

Hsiao-Ying Huang, Cheng-Han Chiang

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Speech & Audio

3w ago·also Beihang

RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

Forget coarse-grained audio-visual tasks: RA-SSU offers frame-level sound source understanding with two new datasets and a transformer-based benchmark.

Chen Su, Man Zhang, Zhenan Sun

Computer Vision Multimodal Models Speech & Audio

Robin Doerfler +13w ago

Physics-Informed Neural Engine Sound Modeling with Differentiable Pulse-Train Synthesis

Forget black-box audio synthesis: this differentiable engine sound model gives you interpretable knobs to control physical parameters like valve dynamics and exhaust resonances.

Robin Doerfler, Lonce Wyse

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

3w ago

How Contrastive Decoding Enhances Large Audio Language Models?

Contrastive Decoding's power-up for audio language models hinges on fixing specific error types, like uncertainty and audio absence, but don't expect it to magically fix flawed reasoning.

Tzu-Quan Lin, Wei-Ping Huang, Yi-Cheng Lin

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Dehua Tao +53w ago

Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models

Get state-of-the-art spoken QA performance by adding lightweight speech modules to frozen VL models and training on synthetically generated speech data, sidestepping the need for massive multimodal datasets.

Dehua Tao, Xuan Luo, Daxin Tan +3

Multimodal Models Speech & Audio Training Efficiency & Optimization

NVIDIA3w ago·also NJU, NSFC

StuPASE: Towards Low-Hallucination Studio-Quality Generative Speech Enhancement

Studio-quality speech enhancement without hallucination is now possible, thanks to a clever combination of dry-target finetuning and flow-matching.

Xiaobin Rong, Jun Gao, Zheng Wang +3

Speech & Audio Training Efficiency & Optimization

Department of Applied Artificial3w ago·also Department of Data Science, KETI

Reading the Mood Behind Words: Integrating Prosody-Derived Emotional Context into Socially Responsive VR Agents

VR agents that "listen" to your tone, not just your words, elicit significantly better user experiences.

SangYeop Jeong, Yeongseo Na, Seung Gyu Jeong +2

Natural Language Processing Speech & Audio Tool Use & Agents

3w ago

The Costs of Reproducibility in Music Separation Research: a Replication of Band-Split RNN

Open-sourcing a fully reproducible, optimized Band-Split RNN for music separation, this paper reveals the surprisingly large gap between published results and what can be achieved with a faithful reimplementation, even with significant effort.

Paul Magron, R. Serizel, Constance Douwes

Open-Source Models & Weights Speech & Audio Training Efficiency & Optimization

Mar 9, 2026

3w ago·also Central Institute of Mental Health, Department of Theoretical Neuroscience, Medical Faculty Mannheim

Electrocardiogram Classification with Transformers Using Koopman and Wavelet Features

Forget wavelets, transformers with Koopman operator-derived features unlock superior ECG classification, especially in complex multi-class scenarios.

Sucheta Ghosh, Zahra Monfared

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

Ulsan National Institute of Science and Technology3w ago·also CMU ML

Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model

Mamba's superior sequence modeling lets you generate longer, more realistic dance sequences than clunky Transformers ever could.

Sangjune Park, Sangjun Park, Inhyeok Choi +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Speech & Audio

Ayush Barik +93w ago

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

Text-to-audio diffusion just got a whole lot faster: SoundWeaver slashes latency by up to 3x without retraining, simply by cleverly reusing similar audio samples.

Ayush Barik, S. Stoica, Sofia Stoica +7

Inference & Quantization Natural Language Processing Speech & Audio

Tsinghua AI3w ago·also Duke

Language-Invariant Multilingual Speaker Verification for the TidyVoice 2026 Challenge

Adversarial training and synthetic data can significantly boost multilingual speaker verification performance, even with limited training data.

Ze Li, Xiaoxiao Miao, Juan Liu +1

Natural Language Processing Speech & Audio

Urawee Thani +23w ago

Unsupervised Domain Adaptation for Audio Deepfake Detection with Modular Statistical Transformations

A modular statistical transformation pipeline boosts audio deepfake detection accuracy by 10.7% in cross-domain scenarios, without needing labeled target data.

Urawee Thani, Gagandeep Singh, Priyanka Singh

Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness Speech & Audio

3w ago·also Sony

LoopLens: Supporting Search as Creation in Loop-Based Music Composition

LoopLens reveals a stark divide in how musicians with and without domain expertise approach creative search for music loops, highlighting the need for vocabulary-independent discovery tools.

Sheng Long, Atsuya Kobayashi, Kei Tateno

Recommendation & Information Retrieval Speech & Audio

3w ago

Fish Audio S2 Technical Report

Open-source TTS gets a serious upgrade with Fish Audio S2, offering instruction-following control via natural language and production-ready streaming performance.

Shijia Liao, Yuxuan Wang, Songting Liu +12

Natural Language Processing Open-Source Models & Weights Speech & Audio

3w ago·also Sheffield

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

Speech LLMs, though lagging in accuracy, capture the nuances of human emotion perception better than traditional supervised methods, a finding revealed by the new VoxEmo benchmark.

Hezhao Zhang, Huang-Cheng Chou, Shrikanth S. Narayanan +1

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Pol Buitrago +33w ago

Quantifying Cross-Lingual Transfer in Paralinguistic Speech Tasks

Paralinguistic speech tasks aren't as language-agnostic as we thought: cross-lingual transfer patterns reveal systematic language dependencies.

Pol Buitrago, Oriol Pareras, Federico Costa +1

Natural Language Processing Speech & Audio

Rania Al-Sabbagh3w ago

Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS

Emirati Arabic finally gets a dedicated, sociolinguistically rich speech corpus, opening doors for better ASR/TTS in this low-resource language.

Rania Al-Sabbagh

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Xiao Yu +53w ago

Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction

LALMs can now better capture the nuances of human emotion, moving beyond single-label predictions with a new ambiguity-aware training framework that aligns model outputs with the full spectrum of human perception.

Xiao Yu, Xiaofeng Yu, Jiaheng Dong +3

Multimodal Models Reasoning & Chain-of-Thought Speech & Audio

Nikita Kuzmin +103w ago

Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models

Turns out your always-on speech dialogue model is leaking speaker identity like a sieve, but a simple feature-domain anonymization technique can boost privacy by 3.5x with minimal impact on performance.

Nikita Kuzmin, N. Kuzmin, Tao Zhong +8

Constitutional AI & AI Ethics Natural Language Processing Speech & Audio

Phillip Long +23w ago

Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

Language models can beat FLAC for lossless audio compression at 8-bit and 16-bit, but their advantage shrinks at 24-bit, revealing a challenge for high-fidelity audio.

Phillip Long, Zachary Novack, Chris Donahue

Eval Frameworks & Benchmarks Inference & Quantization Speech & Audio

Bence Mark Halpern +53w ago

PathBench: Speech Intelligibility Benchmark for Automatic Pathological Speech Assessment

A new benchmark, PathBench, finally allows for standardized comparison of pathological speech assessment methods, revealing that the proposed Dual-ASR Articulatory Precision (DArtP) metric outperforms existing reference-free approaches.

Bence Mark Halpern, B. Halpern, Thomas B. Tienkamp +3

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

School of Computer Engineering3w ago·also TU Munich

Soundscapes in Spectrograms: Pioneering Multilabel Classification for South Asian Sounds

Spectrograms beat MFCCs for South Asian sound classification, unlocking more accurate analysis of complex, overlapping urban soundscapes.

Sudip Chakrabarty, Pappu Bishwas, Rajdeep Chatterjee +3

Computer Vision Multimodal Models Speech & Audio

Avihu Dekel +43w ago

NLE: Non-autoregressive LLM-based ASR by Transcript Editing

Ditch slow, sequential decoding: NLE achieves 27x speedup over autoregressive ASR by using a non-autoregressive, LLM-based transcript editing approach.

Avihu Dekel, Samuel Thomas, Takashi Fukada +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Tsinghua AI3w ago·also Artificial Intelligence Institute of China, Beihang, Beijing Information Science and Technology, BUPT +4

MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals

MLLMs can now reliably interpret electromagnetic signals even in noisy environments, thanks to a new training framework and benchmark designed specifically for this challenging domain.

Junyu Shen, Zhendong She, Chenghanyu Zhang +11

Data Curation & Synthetic Data Multimodal Models Speech & Audio

3w ago

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

A dual-branch Transformer with safe cross-attention overcomes missing visual cues in emotion recognition by dynamically relying on audio, achieving state-of-the-art results on Aff-Wild2.

Naixiang Zheng, Guoyuan Wang, Yunxiang Zhang +1

Computer Vision Multimodal Models Speech & Audio

Shangeth Rajaa3w ago

DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

Silence timeouts are out: DualTurn learns natural turn-taking from unlabeled dual-channel audio, outperforming larger models and anticipating turns more accurately.

Shangeth Rajaa

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Pol Buitrago +43w ago

Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

Unlock AV speech recognition for any language, even with zero labeled video data, by training on synthetically generated talking-head videos.

Pol Buitrago, Pol Galvez, Pol Gàlvez +2

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Aishwarya Fursule +23w ago·also Institut national de la recherche

Gender Fairness in Audio Deepfake Detection: Performance and Disparity Analysis

Even when overall accuracy seems balanced, audio deepfake detection models can exhibit significant gender bias, masked by standard metrics like EER.

Aishwarya Fursule, S. Kshirsagar, Anderson R. Avila

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Speech & Audio

Lucas Rakotoarivony3w ago

Evolution Strategy-Based Calibration for Low-Bit Quantization of Speech Models

Speech models can now be quantized to INT4 with near-lossless performance thanks to a new evolution strategy-based calibration method tailored for audio activations.

Lucas Rakotoarivony

Inference & Quantization Speech & Audio Training Efficiency & Optimization

Maryem Bouziane +23w ago

Learning Multiple Utterance-Level Attribute Representations with a Unified Speech Encoder

Now a single speech foundation model can generate diverse utterance-level representations, like semantics and speaker identity, opening new possibilities for multimodal and multilingual applications.

Maryem Bouziane, Salima Mdhaffar, Yannick Esteve

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

3w ago·also Imperial, KCL

Echo2ECG: Enhancing ECG Representations with Cardiac Morphology from Multi-View Echos

Multi-view Echo data can be used to train ECG encoders that are 18x smaller yet outperform larger models at predicting cardiac morphology.

Michelle Espranita Liman, Özgün Turgut, Alexander Müller +3

Multimodal Models Scientific Discovery & Drug Design Speech & Audio

3w ago

Scalable Neural Vocoder from Range-Null Space Decomposition

Range-Null Space Decomposition offers a surprisingly effective and scalable approach to neural vocoders, outperforming existing methods while using a lightweight network structure.

Andong Li, Tong Lei, Zhihang Sun +5

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

CMU ML3w ago

Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

Foley-Flow achieves state-of-the-art video-to-audio generation by aligning audio-visual representations with masked modeling, enabling rhythmic synchronization that was previously lacking.

Shentong Mo

Computer Vision Multimodal Models Speech & Audio

3w ago

Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation

A new benchmark reveals how existing audio-visual segmentation models crumble when faced with the dynamic, ever-changing audio and visual environments of the real world.

Siddeshwar Raghavan, Gautham Vinod, Bruce Coburn +1

Computer Vision Eval Frameworks & Benchmarks Multimodal Models+1

Youngseo Kim +53w ago

X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

Uncover deepfakes by exploiting the tell-tale audio-visual inconsistencies embedded within generative models' cross-attention mechanisms.

Youngseo Kim, Kwan Yun, Seokhyeon Hong +3

Multimodal Models Red-Teaming & Adversarial Robustness Speech & Audio

Ruixiang Zhao +43w ago

SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval

By explicitly modeling speech, SAVE leapfrogs existing audio-visual methods for video-text retrieval, achieving substantial gains over the state-of-the-art.

Ruixiang Zhao, Zhihao Xu, Bangxiang Lan +2

Multimodal Models Recommendation & Information Retrieval Speech & Audio

Okko Rasanen +13w ago

Computational modeling of early language learning from acoustic speech and audiovisual input without linguistic priors

Self-supervised and visually grounded models are closing the gap in explaining how infants learn language from raw acoustic and visual input, challenging the need for strong linguistic priors.

Okko Rasanen, Okko Räsänen

Multimodal Models Natural Language Processing Speech & Audio

3w ago

Universal Speech Content Factorization

Achieve zero-shot voice conversion competitive with methods requiring more data or training, using a simple, invertible linear method to disentangle speech content from speaker timbre.

Henry Li Xinyuan, Zexin Cai, Lin Zhang +5

Natural Language Processing Speech & Audio

Zihao Fang +63w ago

WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation

Unlock whisper-to-normal speech conversion with a clever trick: synthesize whispered speech from readily available normal speech data to massively augment training.

Zihao Fang, Ying Shen, Yingda Shen +4

Natural Language Processing Speech & Audio

Mengyi Shan +83w ago

Talking Together: Synthesizing Co-Located 3D Conversations from Audio

Finally, realistic 3D avatars can maintain natural eye contact and spatial awareness during conversations, moving beyond disembodied "talking heads."

Mengyi Shan, Shouchieh Chang, Ziqian Bai +6

Computer Vision Multimodal Models Speech & Audio

Mar 8, 2026

3w ago

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

You can now poison a zero-shot TTS model to prevent it from generating speech for specific target speakers, but scaling this defense to a large number of speakers remains a challenge.

Sai Praneeth Karimireddy

Natural Language Processing Red-Teaming & Adversarial Robustness Speech & Audio

Peijun Yang +13w ago

Multi-View Based Audio Visual Target Speaker Extraction

Unleashing the power of multi-view lip reading, this new framework lets you extract a target speaker's voice even from challenging, non-frontal video angles.

Peijun Yang, Zhan Jin

Computer Vision Multimodal Models Speech & Audio

3w ago

StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

SLMs still lag behind omni language models in multi-turn conversational style control, as revealed by the new StyleBench benchmark.

Haishu Zhao, Aokai Hao, Yuan Ge +2

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Robin Doerfler +13w ago

Analysis-Driven Procedural Generation of an Engine Sound Dataset with Embedded Control Annotations

Forget expensive, noisy recordings: this procedural engine sound dataset offers 19 hours of clean, annotated audio for training better automotive sound AI.

Robin Doerfler, Lonce Wyse

Data Curation & Synthetic Data Speech & Audio

3w ago

Evaluating Parkinson's Disease Detection in Anonymized Speech: A Performance and Acoustic Analysis

You can protect patient privacy and still detect Parkinson's from speech, but only if you choose the right anonymization method.

Carlos Franzreb, Francisco Teixeira, Ben Luks +2

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Sumit Ranjan +33w ago

VoiceSHIELD-Small: Real-Time Malicious Speech Detection and Transcription

Achieve near-perfect accuracy in real-time malicious speech detection without sacrificing transcription speed, using a lightweight model built on Whisper.

Sumit Ranjan, Sugandha Sharma, Ubaid Abbas +1

Natural Language Processing Red-Teaming & Adversarial Robustness Speech & Audio

Longbiao Cheng +13w ago

Towards Lightweight Adaptation of Speech Enhancement Models in Real-World Environments

Forget full fine-tuning: Low-rank adapters let you adapt speech enhancement models to new acoustic environments on-device, updating less than 1% of parameters for significant quality gains.

Longbiao Cheng, Shih-Chii Liu

Inference & Quantization Speech & Audio Training Efficiency & Optimization

Rishikesh Kumar Sharma +63w ago

Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR

Forget massive multilingual models: fine-tuning on just 5 hours of speech data from a related language slashes ASR error rates for an endangered language, rivaling the performance of Whisper-Small.

Rishikesh Kumar Sharma, Safal Narshing Shrestha, Jenny Poudel +4

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Tajamul Ashraf +93w ago

Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech

Zero-shot multilingual TTS models stumble when synthesizing Kashmiri, but a script-aware, flow-based adaptation strategy unlocks intelligible speech.

Tajamul Ashraf, Burhaan Rasheed Zargar, Saeed Abdul Muizz +7

Natural Language Processing Open-Source Models & Weights Speech & Audio

3w ago

Learning-free L2-Accented Speech Generation using Phonological Rules

Achieve accent-specific speech synthesis without any accented training data by cleverly combining phonological rules with multilingual TTS.

Yoonjeong Lee, Jihwan Lee, Tiantian Feng

Natural Language Processing Speech & Audio

3w ago

Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data

Control the accent of your TTS output without needing any accented training data, by transferring accent characteristics from other languages.

Thanathai Lertpetchpun, Thanapat Trachu, Jihwan Lee +3

Natural Language Processing Speech & Audio