May 1 – May 8, 2026

Speech & Audio - Weekly Roundup

42 papers published across 2 labs.

Selected Labs publishing this week

Top Papers

May 6, 2026

Yijing Tu +32w ago

Stream-T1: Test-Time Scaling for Streaming Video Generation

Achieve superior video generation quality and temporal coherence without expensive retraining by intelligently scaling and steering diffusion models at test time.

Yijing Tu, Wenchuan Wang, Chunxiao Liu +1

Computer Vision Speech & Audio

Jinju Lee2w ago

Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation

Fine-tuning a chord generation model on a new genre requires only a surprisingly small amount of old-genre data to prevent catastrophic forgetting, but objective metrics don't always capture subjective stylistic preferences.

Jinju Lee

Speech & Audio

Yangchen Yu +72w ago

To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

Standard multimodal fusion can hurt performance in emotion recognition, but this new approach knows when to drop modalities, leading to state-of-the-art results.

Yangchen Yu, Qian Chen, Jia Li +5

Multimodal Models Natural Language Processing Speech & Audio

Michael Soprano +22w ago

Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes

Human crowdsourcing struggles to reliably identify audiovisual deepfakes, especially when both audio and video are manipulated, suggesting current detection methods may overestimate human capabilities.

Michael Soprano, A. Cioci, Stefano Mizzaro

Computer Vision Constitutional AI & AI Ethics Speech & Audio

Stefano Cecconello +42w ago

From Beats to Breaches:How Offensive AI Infers Sensitive User Information from Playlists

Your innocent Spotify playlists are leaking surprisingly accurate predictions about your age, habits, and even personality traits, thanks to new AI attack.

Stefano Cecconello, Mauro Conti, Luca Pajola +2

Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness Speech & Audio

All Papers (42)

May 6, 2026

Yijing Tu +32w ago

Stream-T1: Test-Time Scaling for Streaming Video Generation

Achieve superior video generation quality and temporal coherence without expensive retraining by intelligently scaling and steering diffusion models at test time.

Yijing Tu, Wenchuan Wang, Chunxiao Liu +1

Computer Vision Speech & Audio

Jinju Lee2w ago

Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation

Jinju Lee

Speech & Audio

Yangchen Yu +72w ago

To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

Standard multimodal fusion can hurt performance in emotion recognition, but this new approach knows when to drop modalities, leading to state-of-the-art results.

Yangchen Yu, Qian Chen, Jia Li +5

Multimodal Models Natural Language Processing Speech & Audio

Michael Soprano +22w ago

Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes

Michael Soprano, A. Cioci, Stefano Mizzaro

Computer Vision Constitutional AI & AI Ethics Speech & Audio

Stefano Cecconello +42w ago

From Beats to Breaches:How Offensive AI Infers Sensitive User Information from Playlists

Your innocent Spotify playlists are leaking surprisingly accurate predictions about your age, habits, and even personality traits, thanks to new AI attack.

Stefano Cecconello, Mauro Conti, Luca Pajola +2

Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness Speech & Audio

Zheng Fang +42w ago

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

Turns out you only need to tweak a few key audio tokens to jailbreak audio language models, opening the door to faster, more targeted attacks.

Zheng Fang, Xiaosen Wang, Shenyi Zhang +2

Red-Teaming & Adversarial Robustness Speech & Audio Training Efficiency & Optimization

Zeng Ren +32w ago

Library learning with e-graphs on jazz harmony

E-graphs can help AI learn the unwritten rules of jazz harmony, mirroring how human musicians internalize complex musical patterns.

Zeng Ren, Maddy Bowers, Xinyi Guan +1

Code Generation & Program Synthesis Speech & Audio

Yukun Chen +42w ago

VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

Unlock scalable, high-quality singing voice synthesis by directly generating structured musical scores from audio, outperforming existing systems on multiple datasets.

Yukun Chen, Tianrui Wang, Zhaoxi Mu +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Xuanhao Zhang +12w ago

Stage-adaptive audio diffusion modeling

Audio diffusion models can be trained more efficiently by dynamically adjusting optimization strategies based on the evolving balance between semantic acquisition and fine-detail refinement during training.

Xuanhao Zhang, Chang Li

Speech & Audio Training Efficiency & Optimization

Leying Zhang +42w ago

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

LLMs can now evaluate audio as well as humans, without task-specific training, thanks to a new instruction-driven framework.

Leying Zhang, Bowen Shi, Haibin Wu +2

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Cyril Allauzen +42w ago

Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)

Audio-native LLMs still lag behind cascaded architectures in key audio tasks, suggesting that the multimodal promise of LLMs isn't quite ready for prime time in the sound domain.

Cyril Allauzen, Tom Bagby, G. Heigold +2

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Rajeshwar Tripathi +32w ago

Hearing the Ocean: Bio-inspired Gammatone-CNN framework for Robust Underwater Acoustic Target Classification

Bio-inspired signal processing lets you hear subtle underwater sounds better than ever, achieving 98.41% accuracy in classifying targets even in noisy conditions.

Rajeshwar Tripathi, Sandeep Kumar, Monika Aggarwal +1

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

Dongheon Lee +62w ago

Spatial-Magnifier: Spatial upsampling for multichannel speech enhancement

Unlock near-oracle speech enhancement performance from compact microphone arrays by virtually expanding their spatial coverage with a novel neural network.

Dongheon Lee, Ashutosh Pandey, Sanjeel Parekh +4

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

May 5, 2026

2w ago·also Antwerp

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

Aesthetic quality unlocks better generalization in AI-generated music popularity prediction, beating models trained solely on engagement metrics.

Jaavid Aktar Husain, Jaavid Aktar Husain, Dorien Herremans +1

Recommendation & Information Retrieval Speech & Audio

IIIT-Delhi2w ago·also Guru Gobind Singh Indraprastha University

DECKER: Domain-invariant Embedding for Cross-Keyboard Extraction and Recognition

Even with domain adaptation, your keystrokes are still vulnerable to acoustic side-channel attacks across diverse keyboards, users, and noisy environments.

B. B. P. Maurya, Nitin Choudhury, Daksh Agarwal +1

Red-Teaming & Adversarial Robustness Speech & Audio

2w ago·also Huawei

Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning

Automating stage lighting control across diverse venues is now possible without expert demonstrations, thanks to a novel imitation learning approach that decomposes global color distributions into individual light controls.

Zijian Zhao, Dian Jin, Zijing Zhou +1

Robotics & Embodied AI Speech & Audio

Van-Phat Thai +32w ago

Contrastive Regularization for Accent-Robust ASR

Make your ASR models 25% more accent-robust with this surprisingly simple contrastive loss trick.

Van-Phat Thai, Aradhya Dhruv, D. Pham +1

Natural Language Processing Speech & Audio Training Efficiency & Optimization

Pham Hoang Hai +32w ago

Towards Open World Sound Event Detection

Sound event detection gets a reality check: a new framework tackles the messy, unpredictable world of unseen sounds, not just the curated ones.

Pham Hoang Hai, Le Trong Minh, Le Hoang Son +1

Computer Vision Speech & Audio

2w ago

PHALAR: Phasors for Learned Musical Audio Representations

Stem retrieval accuracy leaps forward by 70% thanks to a new architecture that finally respects the phase of music.

Davide Marincione, Michele Mancusi, G. Strano +4

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Manan Mittal +32w ago

Adaptive Diagonal Loading for Norm Constrained Beamforming

Guaranteeing stable beamforming in dynamic acoustic environments is now possible with a novel adaptive diagonal loading method that strictly bounds White Noise Gain.

Manan Mittal, R. Corey, John R. Buck +1

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Ragib Amin Nihal +42w ago

Ecologically-Constrained Task Arithmetic for Multi-Taxa Bioacoustic Classifiers Without Shared Data

Forget federated learning, bioacoustic classifiers can be unified across 661 species by simply averaging independently trained task vectors, unlocking a collaborative, privacy-preserving paradigm.

Ragib Amin Nihal, Benjamin Yen, Runwu Shi +2

Data Curation & Synthetic Data Scientific Discovery & Drug Design Speech & Audio

Busayo Awobade +22w ago

AfriVox-v2: A Domain-Verticalized Benchmark for In-the-Wild African Speech Recognition

Modern speech models struggle to generalize to noisy, domain-specific African speech, highlighting a critical gap for localized voice AI.

Busayo Awobade, Gabrial Zencha Ashungafac, Tobi Olatunji

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

2w ago

Cosmodoit: A Python Package for Adaptive, Efficient Pipelining of Feature Extraction from Performed Music

Stop wrestling with disparate tools and languages for music performance analysis: Cosmodoit offers a unified Python pipeline for efficient, large-scale feature extraction.

C. Guichaoua, D. Bedoya, Elaine Chew

Code Generation & Program Synthesis Speech & Audio

Khalid Zaman +32w ago

Deepfake Audio Detection Using Self-supervised Fusion Representations

Fusing speech and environmental sound representations with a novel matching head and cross-attention network significantly boosts deepfake audio detection, surpassing previous benchmarks.

Khalid Zaman, Qixuan Huang, Muhammad Uzair +1

Natural Language Processing Speech & Audio

Louis Lerbourg +32w ago

Smart Passive Acoustic Monitoring: Embedding a Classifier on AudioMoth Microcontroller

Dramatically extend the battery life of bioacoustic sensors by embedding a highly accurate CNN classifier directly on a microcontroller, enabling selective recording of target species.

Louis Lerbourg, Paul Peyret, Juliette Linossier +1

Inference & Quantization Speech & Audio

Lyonel Behringer +42w ago

Assessing the Impact of Noise and Speech Enhancement on the Intelligibility of Speech Codecs

Classical speech codecs still outperform neural codecs in noisy environments, but speech enhancement can close the gap.

Lyonel Behringer, Anna Leschanowsky, A. Rajasekhar +2

Natural Language Processing Speech & Audio

2w ago·also Hainan University

Enhancing Self-Supervised Talking Head Forgery Detection via a Training-Free Dual-System Framework

Even without retraining, a simple dual-system approach can significantly boost the performance of self-supervised talking head forgery detectors by refining the ordering of uncertain samples.

Ke Liu, Jiwei Wei, Shuchang Zhou +5

Computer Vision Red-Teaming & Adversarial Robustness Speech & Audio

Jing Gong2w ago

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

Open-sourcing a 0.1B-scale speech-native omni model lets you directly inspect the complete interaction loop and reveals critical design choices for building effective small multimodal models.

Jing Gong

Multimodal Models Open-Source Models & Weights Speech & Audio

May 4, 2026

Ahsan Jamal Cheema2w ago

Neck-Learn: Attention-Based Multiple Instance Learning and Ensemble Framework for Ecological Momentary Assessment

Neck-Learn's hybrid architecture, combining gradient-boosted trees and CNN-based multiple instance learning, unlocks improved ambulatory detection of vocal hyperfunction by preserving crucial temporal dynamics in voice data.

Ahsan Jamal Cheema

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

Yadi Wen +42w ago

Private Speech Classification without Collapse: Stabilized DP Training and Offline Distillation

Strong differential privacy can cause speech classifiers to collapse into near-useless single-class predictors, but a two-stage training process involving distillation can stabilize training.

Yadi Wen, Tianxin Li, Enji Liang +2

Inference & Quantization Speech & Audio Training Efficiency & Optimization

Jiaxu He +62w ago

Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation

Transfer learning from a large, pre-trained speech synthesis model unlocks high-quality Tibetan TTS, even with limited Tibetan-specific data.

Jiaxu He, Chao Wang, Jie Lian +4

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Posts2w ago·also Telecommunications Institute of Technology

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

Existing deepfake detectors crumble when faced with realistic, multi-region speech inpainting, leaving a gaping vulnerability that this work begins to address.

Tung Vu, Yen Nguyen, Hai Nguyen +2

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Speech & Audio

2w ago

Multi-Axis Speech Similarity via Factor-Partitioned Embeddings

Stop letting speaker identity drown out semantic similarity: this new embedding method lets you independently control the influence of different speech attributes when comparing utterances.

Jim O'Regan, Jens Edlund

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

ETH2w ago·also UZH

When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

Despite the promise of multimodal context, current audio-language models struggle to leverage clinical information for dysarthric speech recognition, even degrading performance in some cases.

Pehuén Moure, Niclas Pokel, Bilal Bounajma +4

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

2w ago·also JHU, Monash

Dimensionality-Aware Anomaly Detection in Learned Representations of Self-Supervised Speech Models

Adversarial attacks on speech models leave tell-tale geometric fingerprints in early representation layers that can be detected without transcripts.

Sandra Arcos-Holzinger, Sarah M. Erfani, James Bailey +1

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness Speech & Audio

Venkata Pushpak Teja Menta2w ago

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

Synthetic data closes the Indic ASR gap where commercial and open-source systems fail, boosting entity recognition by up to 22x.

Venkata Pushpak Teja Menta

Data Curation & Synthetic Data Open-Source Models & Weights Speech & Audio

May 3, 2026

Central Conservatory of Music2w ago·also Tsinghua AI

Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation

Forget separate structure and fidelity models – Khala shows you can generate high-quality music with text-vocal alignment using a single acoustic-token hierarchy.

Jiafeng Liu, Yuanliang Dong, Hongjia Liu +8

Architecture Design (Transformers, SSMs, MoE)Scaling Laws & Emergent Abilities Speech & Audio

Huan Zhang +92w ago

RenCon 2025: Revival of the Expressive Performance Rendering Competition

Expressive piano performance rendering is improving, but RenCon 2025 reveals we're still far from replicating human musicality.

Huan Zhang, Taegyun Kwon, Anders Friburg +7

Eval Frameworks & Benchmarks Speech & Audio

2w ago·also Tsinghua AI, AgiBot

Spoken Language Identification with Pre-trained Models and Margin Loss

Margin loss fine-tuning of ECAPA-TDNNs slashes the error rate in spoken language identification by over 50%, highlighting the power of discriminative representation learning.

Zhihua Fang, Liang He, Weiwu Jiang

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Xiaoda Yang +122w ago

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

Current audio-visual models nail unimodal quality but still struggle to make music and dance move together rhythmically, highlighting a key gap TMD-Bench is designed to address.

Xiaoda Yang, Majun Zhang, Changhao Pan +10

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Xinmeng Xu +52w ago

Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning

Audio-visual models can be significantly improved by delaying perceptual commitment, correcting intermediate fusion states only when they have sufficient cross-layer and cross-modal support.

Xinmeng Xu, Haoran Xie, S. Joe Qin +3

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

May 1, 2026

Venkata Pushpak Teja Menta3w ago

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Speaker embeddings leak script information, especially when projecting Western voices into Indic scripts, but LASE fixes this with a language-adversarial training objective.

Venkata Pushpak Teja Menta

Natural Language Processing Red-Teaming & Adversarial Robustness Speech & Audio

Search

Speech & Audio - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (42)