April 20 – April 27, 2026

Speech & Audio - Weekly Roundup

65 papers published across 6 labs.

3500% acceleration

Selected Labs publishing this week

NVIDIA2 Amazon Science2 Microsoft Research1 UW1 Tsinghua AI1

Top Papers

Apr 23, 2026

Olivia Martin +1Apr 23, 2026

Participation and Representation in Local Government Speech

Local democracy's "public" input is heavily skewed towards older, whiter, more male, more liberal homeowners, and even removing remote access doesn't fix it.

Olivia Martin, Amar Venugopal

Natural Language Processing Speech & Audio

Apr 27, 2026

NVIDIAApr 27, 2026·also Amazon Science, Microsoft Research, UW, Music X Lab +1

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.

Nvidia Amala Sanjay Deshmukh, K. Chumachenko, Tuomas Rintamaki +209

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Liang Xu +4Apr 27, 2026

Speech Enhancement Based on Drifting Models

Ditch the slow sampling: DriftSE achieves state-of-the-art speech enhancement in a single step, outperforming diffusion models with a novel equilibrium-based approach.

Liang Xu, Diego Caviedes-Nozal, B. Kleijn +2

Speech & Audio Training Efficiency & Optimization

Leekyung Kim +1Apr 27, 2026

An Event-Based Sequence Modeling Approach to Recognizing Non-Triad Chords with Oversegmentation Minimization

Segmenting music into meaningful chunks and predicting chords sequence-to-sequence boosts recognition accuracy, especially for those pesky, rare non-triad chords that plague existing systems.

Leekyung Kim, Jonghun Park

Natural Language Processing Speech & Audio

Apr 27, 2026·also SJTU

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

ASR systems can now be more trustworthy: this work shows how to train them to abstain from transcribing uncertain segments, leading to more reliable outputs.

Wen-Chin Huang, Yuhang Qiu, Bohan Li +5

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

All Papers (65)

Apr 27, 2026

NVIDIAApr 27, 2026·also Amazon Science, Microsoft Research, UW, Music X Lab +1

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Nvidia Amala Sanjay Deshmukh, K. Chumachenko, Tuomas Rintamaki +209

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Liang Xu +4Apr 27, 2026

Speech Enhancement Based on Drifting Models

Ditch the slow sampling: DriftSE achieves state-of-the-art speech enhancement in a single step, outperforming diffusion models with a novel equilibrium-based approach.

Liang Xu, Diego Caviedes-Nozal, B. Kleijn +2

Speech & Audio Training Efficiency & Optimization

Leekyung Kim +1Apr 27, 2026

An Event-Based Sequence Modeling Approach to Recognizing Non-Triad Chords with Oversegmentation Minimization

Segmenting music into meaningful chunks and predicting chords sequence-to-sequence boosts recognition accuracy, especially for those pesky, rare non-triad chords that plague existing systems.

Leekyung Kim, Jonghun Park

Natural Language Processing Speech & Audio

Apr 27, 2026·also SJTU

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

ASR systems can now be more trustworthy: this work shows how to train them to abstain from transcribing uncertain segments, leading to more reliable outputs.

Wen-Chin Huang, Yuhang Qiu, Bohan Li +5

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Apr 27, 2026

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

Audio-Language models are cheating on benchmarks, acing tests even when they barely listen.

Leonardo Haw-Yang Foo, Chih-Kai Yang, Chen-An Li +2

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Institut Polytechnique de ParisApr 27, 2026

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

Shrinking massive audio foundation models by up to 61x is now possible without significant performance loss, thanks to a novel self-supervised distillation approach that works directly on embeddings.

Mohammed Ali El Adlouni, Aurian Quelennec, Pierre Chouteau +2

Inference & Quantization Speech & Audio Training Efficiency & Optimization

Apr 26, 2026

Zhen Ye +10Apr 26, 2026

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Disentangling high-level cross-modal reasoning from low-level modality-specific refinement in talking head generation yields superior lip-sync accuracy, video quality, and audio quality compared to entangled approaches.

Zhen Ye, Xu Tan, Aoxiong Yin +8

Computer Vision Multimodal Models Speech & Audio

Apr 23, 2026

Jialong Mai +2Apr 23, 2026

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

Finally, a TTS system that lets you control the *exact* timing and pauses of individual words, opening the door to applications like perfectly paced guided reading and accessible code narration.

Jialong Mai, Xiaofen Xing, Xiangmin Xu

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Eli Gildish +2Apr 23, 2026

Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach

Achieve state-of-the-art periodic signal denoising with a single, lightweight dilated CNN that generalizes across frequencies via resampling.

Eli Gildish, Michael Grebshtein, I. Makienko

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Apr 23, 2026·also B (2.53) outperforms low-compression

Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition

Counterintuitively, scaling up LLM decoders in speech recognition doesn't guarantee fairness; audio encoder design matters more, as Whisper's pathological hallucinations on Indian-accented speech and repetition loops under masking demonstrate.

Srishti Ginjala, E. Fosler-Lussier, Christopher Myers +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Speech & Audio

Tasnim Kabir +5Apr 23, 2026

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

SOTA audio QA models are getting punked by trivia questions a toddler could answer, revealing a stark gap between current capabilities and true audio understanding.

Tasnim Kabir, Dmytro Kurdydyk, Aadi Palnitkar +3

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Breno Matos +4Apr 23, 2026

Misinformation Span Detection in Videos via Audio Transcripts

Pinpointing exactly *when* misinformation occurs in videos is now possible, thanks to two new datasets and a strong baseline for misinformation span detection.

Breno Matos, Rennan C. Lima, Savvas Zannettou +2

Multimodal Models Natural Language Processing Speech & Audio

B. Muller +2Apr 23, 2026

Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers

Surprisingly, how speech degrades due to diseases like Parkinson's and ALS follows consistent patterns across languages, offering a universal fingerprint for these conditions.

B. Muller, Antonio Armando Ortiz Barran'on, L. Roberts

Natural Language Processing Speech & Audio

Srija Anand +15Apr 23, 2026

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

Forget English – this study reveals which TTS systems truly resonate with native speakers across ten diverse Indian languages, pinpointing specific perceptual dimensions that drive preference.

Srija Anand, Ashwin Sankar, Ishvinder Sethi +13

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Nikhil RaghavApr 23, 2026

DiariZen Explained: A Tutorial for the Open Source State-of-the-Art Speaker Diarization Pipeline

Demystifying state-of-the-art speaker diarization just got easier: this tutorial breaks down the DiariZen pipeline block-by-block, complete with code, tensor shapes, and visualizations.

Nikhil Raghav

Natural Language Processing Open-Source Models & Weights Speech & Audio

Adam vStefunko +1Apr 23, 2026

Beyond Rules: Towards Basso Continuo Personal Style Identification

Basso continuo, a centuries-old improvised accompaniment, isn't just about following the rules – AI can now identify individual players by their unique stylistic fingerprints.

Adam vStefunko, Jan Hajivc

Natural Language Processing Speech & Audio

Chengyou Wang +8Apr 23, 2026

Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

Current spoken dialogue systems struggle with the nuances of human conversation, but a new benchmark offers a path to more natural interactions by focusing on handling interruptions and overlapping speech.

Chengyou Wang, Hong Yue, Guojian Li +6

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Technische Hochschule Nürnberg Georg Simon OhmApr 23, 2026

Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in Wav2vec 2.0

Turns out where you look in Wav2vec 2.0's representations *really* matters: intelligibility lives in the layers, while articulation problems hide in the time steps.

Natalie Engert, Dominik Wagner, K. Riedhammer +1

Interpretability & Mechanistic Interp Natural Language Processing Speech & Audio

Olivia Martin +1Apr 23, 2026

Participation and Representation in Local Government Speech

Local democracy's "public" input is heavily skewed towards older, whiter, more male, more liberal homeowners, and even removing remote access doesn't fix it.

Olivia Martin, Amar Venugopal

Natural Language Processing Speech & Audio

Mirage Mountain Technologies IncApr 23, 2026

Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training

Forget text-only pre-training: training on music *first* can dramatically accelerate language learning in small language models.

Yoshinori Nomura

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Apr 23, 2026·also Le Mans University, LIA -Avignon University

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

LLMs can judge speech recognition quality with near-human accuracy, blowing away traditional metrics like Word Error Rate.

Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil +6

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Apr 22, 2026

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

Current audio-language models are surprisingly bad at controlling and interpreting subtle vocal cues, failing in nearly half of situational dialogue scenarios.

Ruohan Liu, Shukang Yin, Dong Zhang +5

Eval Frameworks & Benchmarks Speech & Audio

Apr 22, 2026·also China Conservatory of Music, NTU

ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

Current omnimodal models may excel in perceptual tasks but fundamentally misunderstand music theory, exposing critical reasoning flaws.

Menghe Ma, Siqing Wei, Yaheng Wang +4

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Magdalena Golkebiowska +1Apr 22, 2026

Enhancing Speaker Verification with Whispered Speech via Post-Processing

Speaker verification systems can be made significantly more robust to whispered speech by using a simple encoder-decoder architecture and a joint training objective.

Magdalena Golkebiowska, Piotr Syga

Natural Language Processing Speech & Audio

Tong Zhao +3Apr 22, 2026

ATIR: Towards Audio-Text Interleaved Contextual Retrieval

Key contribution not extracted.

Tong Zhao, Chenghao Zhang, Yutao Zhu +1

Multimodal Models Recommendation & Information Retrieval Speech & Audio

Apr 22, 2026·also & UAE, CUHK, Edinburgh, University of Aveiro

Aligning Stuttered-Speech Research with End-User Needs: Scoping Review, Survey, and Guidelines

Stuttered-speech research is missing the mark: a new study reveals a significant mismatch between current research priorities and the actual needs of people who stutter.

Hawau Olamide Toyin, Mutiah Apampa, Toluwani Aremu +6

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

FindLabApr 22, 2026

From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR

Finally, a practical OMR system can handle complex polyphonic music, like piano scores, by intelligently decoding visual symbols into editable scores.

Shiheng Li, Shengchao Hou

Computer Vision Multimodal Models Speech & Audio

Tianshui Chen +5Apr 22, 2026

Learning Spatial-Temporal Coherent Correlations for Speech-Preserving Facial Expression Manipulation

Speakers expressing the same content with different emotions exhibit surprisingly consistent spatial-temporal correlations in their local facial animations, unlocking a new approach to speech-preserving facial expression manipulation.

Tianshui Chen, Jianman Lin, Zhijing Yang +3

Computer Vision Multimodal Models Speech & Audio

Apr 22, 2026

Before the Mic: Physical-Layer Voiceprint Anonymization with Acoustic Metamaterials

A 3D-printable acoustic metamaterial can scramble your voiceprint at the physical layer, protecting your identity even when microphones are compromised.

Zhiyuan Ning, Zhanyong Tang, Xiaojiang Chen

Speech & Audio

P. A. Bereuter +1Apr 22, 2026

Embedding-Based Intrusive Evaluation Metrics for Musical Source Separation Using MERT Representations

Ditch your old MSS evaluation metrics: MERT-based embeddings correlate far better with human perception.

P. A. Bereuter, Alois Sontacchi

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Speech & Audio

Apr 21, 2026

Lam Pham +10Apr 21, 2026

Environmental Sound Deepfake Detection Using Deep-Learning Framework

Separating sound scene and sound event deepfake detection as individual tasks dramatically improves performance, paving the way for more robust audio forensics.

Lam Pham, Khoi Vu, Khoi D. Vu +8

Computer Vision Speech & Audio

Feiyu Zhao +6Apr 21, 2026·also TJU

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

LALMs struggle to ground their responses in audio, exhibiting surprising failures in temporal reasoning and music understanding that HalluAudio exposes.

Feiyu Zhao, Yiming Chen, Wenhuan Lu +4

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

NVIDIAApr 21, 2026

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

Bridging the offline-streaming gap in ASR is now more achievable: a single RNN-Transducer model can deliver high accuracy in both settings, thanks to a novel consistency regularization technique.

A.S. Andrusenko, Vladimir Bataev, Lilit Grigoryan +3

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Kaushal Bhogale +22Apr 21, 2026

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

Current ASR systems stumble significantly when faced with the nuances of real-world Indian speech, as revealed by a new benchmark exposing geographic performance disparities and the impact of audio quality, speaking rate, and device type.

Kaushal Bhogale, Kaushal Bhogale, Manas Dhir +20

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Apr 21, 2026

MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

Finally, you can puppeteer both the sights and sounds of AI-generated characters, controlling their identity, voice, pose, and scene with unprecedented precision.

Liyang Li, Wen Wang, Canyu Zhao +2

Computer Vision Multimodal Models Speech & Audio

Swansea UniversityApr 21, 2026·also University of Hertfordshire

Achieving Interaction Fluidity in a Wizard-of-Oz Robotic System: A Prototype for Fluid Error-Correction

Current human-robot interaction feels clunky because we lack the right development tools, so this work introduces a VR-based platform designed from the ground up to enable fluid error correction in Wizard-of-Oz robotic systems.

C. Lima, Carlos Baptista De Lima, Julian Hough +4

Natural Language Processing Robotics & Embodied AI Speech & Audio

Waldek MaciejkoApr 21, 2026

Audio Spoof Detection with GaborNet

GaborNet, a Gabor filter-based front-end for raw audio processing, significantly boosts audio spoof detection accuracy in RawNet2 and RawGAT-ST architectures.

Waldek Maciejko

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

Amazon ScienceApr 21, 2026

Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

SpeechLLMs' hallucinations betray themselves in their attention patterns, offering a new way to detect these errors without needing expensive human-labeled data.

Jonas Waldendorf, Bashar Awwad Shiekh Hasan, Evgenii Tsymbalov

Interpretability & Mechanistic Interp Natural Language Processing Speech & Audio

Apr 21, 2026·also FU Berlin, Luxembourg

Smiling Regulates Emotion During Traumatic Recollection

Smiling during traumatic recollection not only occurs in moments of distress but actively enhances emotional recovery and narrative coherence.

Marcus Ma, E. Zhou, Leon Ludwig +7

Computer Vision Natural Language Processing Speech & Audio

Jeffrey R. Boland +1Apr 21, 2026

Tonnetz Theory, Classical Harmony, and the Combinatorial Geometry of Abstract Musical Resources

Music theory meets math: combinatorial geometry provides a surprisingly elegant framework for understanding and generating musical structures, from classical harmonies to 12-tone systems.

Jeffrey R. Boland, L. Hughston

Speech & Audio

Tsinghua AIApr 21, 2026·also CUHK

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

Autoregressive generative models, previously unsuitable for real-time target speaker extraction, can now achieve offline-level performance in streaming scenarios thanks to a novel chunk-wise splicing technique.

Shuhai Peng, Hui Lu, Jinjiang Liu +8

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Lekai Qian +4Apr 21, 2026

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

Forget complex event sequences: tokenizing music by uniform temporal beats unlocks better musical quality and structural coherence in generated music.

Lekai Qian, Haoyu Gu, Haoyue Gu +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Apr 21, 2026·also Gachon University

Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean

Seoul Korean pitch accent classification achieves state-of-the-art results by learning F0 contour representations with deep supervised contrastive learning, despite the inherent variability in real-world speech.

Hyunjung Joo, GyeongTaek Lee, Gyeong-Myeong Lee

Natural Language Processing Speech & Audio

Apr 21, 2026·also Guangdong University of Technology

ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis

Finally, anime avatars can convincingly express a full range of emotions without losing their unique vocal identity.

Ao Li, Haoran Lv, Shengming Li +3

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Girish +3Apr 21, 2026·also IIIT-Delhi

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

Key contribution not extracted.

Girish, Mohd Mujtaba Akhtar, Orchid Chetia Phukan +1

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Speech & Audio

Yadong Li +3Apr 21, 2026

UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

Ditch the clunky pipeline: a single LLM can now handle all your audio front-end needs, slashing latency and boosting accuracy in full-duplex speech interactions.

Yadong Li, Guoxin Wu, Haiping Hou +1

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Natural Language Processing+1

Jianbo Ma +1Apr 21, 2026

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

Achieve state-of-the-art TTS with significantly fewer parameters by explicitly modeling temporal dynamics in a cascaded architecture that implicitly handles phonetic planning.

Jianbo Ma, Richard Cartwright

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Hirotaka Obo +3Apr 21, 2026

Self-Noise Reduction for Capacitive Sensors via Photoelectric DC Servo: Application to Condenser Microphones

Key contribution not extracted.

Hirotaka Obo, Atsushi Tsuchiya, T. Ebihara +1

Robotics & Embodied AI Scientific Discovery & Drug Design Speech & Audio

Faisal AlherranApr 21, 2026

Tadabur: A Large-Scale Quran Audio Dataset

Finally, a dataset large and diverse enough to train robust models for Quranic speech research.

Faisal Alherran

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Apr 20, 2026

Apr 20, 2026·also Changsha University of Science and Technology

FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLMs

FreezeEmpath achieves superior empathetic dialogue capabilities without the need for costly finetuning, relying instead on frozen LLMs and existing data.

Yun Hong, Yan Zhou

Natural Language Processing Speech & Audio

Li Ya +6Apr 20, 2026·also Hainan Normal University, Rice

A novel LSTM music generator based on the fractional time-frequency feature extraction

Time-frequency feature extraction via fractional Fourier transform unlocks surprisingly high-quality music generation from LSTMs.

Li Ya, Chen Wei, Xiulai Li +4

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

MILES TeamApr 20, 2026·also ESPCI PSL, GETALP Team, Grenoble INP, LAMSADE +5

Where Do Self-Supervised Speech Models Become Unfair?

Bias against certain speaker groups is embedded in self-supervised speech models from the very first layers, complicating efforts to achieve fairness in speech recognition tasks.

Felix Herron, Maja Hjuler, Solange Rossato +2

Constitutional AI & AI Ethics Speech & Audio

Apr 20, 2026·also University of Jena

Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages

Phoneme recognition accuracy in low-resource languages hinges more on data availability than phonological complexity, revealing critical insights for ASR model development.

V. S. D. S. Mahesh Akavarapu, Michael Daniel, Gerhard Jäger

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Jiaqi Song +11Apr 20, 2026·also UC Santa Cruz

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

LLM-based ASR can be shrunk to 2.3B parameters and still beat larger models in real-world scenarios by carefully delineating encoder and LLM roles and using a multi-stage training approach.

Jiaqi Song, Guang Qiu, Guanghui Qiu +9

Inference & Quantization Natural Language Processing Scaling Laws & Emergent Abilities+1

Apr 20, 2026

CanonSLR: Canonical-View Guided Multi-View Continuous Sign Language Recognition

CanonSLR achieves unprecedented robustness in sign language recognition by effectively bridging the gap between frontal and non-frontal viewpoints.

Shengeng Tang, Wan Jiang, Yaxiong Wang +2

Computer Vision Multimodal Models Speech & Audio

Huakang Chen +15Apr 20, 2026·also Lingguang Zhaxian Technology, Northwestern, Yutu Zhineng

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

Open-source TTS models can beat commercial systems in specific languages, but current instruction-following TTS still struggles with complex instructions like nuanced paralinguistic controls.

Huakang Chen, Jingbin Hu, Liumeng Xue +13

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Chenqian Le +6Apr 20, 2026

Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features

SPARC features unlock more accurate and interpretable sEMG-based silent speech modeling compared to traditional phoneme representations.

Chenqian Le, Ruisi Li, Beatrice Fumagalli +4

Natural Language Processing Scientific Discovery & Drug Design Speech & Audio

Haoming Meng +5Apr 20, 2026·also Zuoyebang Education Technology

Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

LLMs can learn musicality without human annotation by aligning them to automatically generated preference datasets derived from rule-based musical constraints.

Haoming Meng, Hao Meng, Siyuan Zheng +3

Natural Language Processing Speech & Audio

Pengcheng LaboratoryApr 20, 2026·also HIT

APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track

By explicitly verifying the visual existence of spoken references before segmentation, APRVOS substantially improves robustness in noisy audio-conditioned Ref-VOS, outperforming standard pipelines.

Deshui Miao, Yameng Gu, Chao Yang +1

Computer Vision Multimodal Models Speech & Audio

Xiang He +7Apr 20, 2026

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

Forget supervised fine-tuning: RL alone can unlock high-quality chain-of-thought reasoning in audio-language models, even starting from a model with no prior CoT capability.

Xiang He, Chenxing Li, Jinting Wang +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Speech & Audio

Haejun Yoo +6Apr 20, 2026·also Sogang University

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

Multimodal LLMs aren't just for generation: they can dramatically improve audio-text retrieval robustness, especially when handling complex, real-world queries and acoustically similar distractors.

Haejun Yoo, Yongseop Shin, Yong-Joo Shin +4

Multimodal Models Recommendation & Information Retrieval Speech & Audio

MIT CSAILApr 20, 2026·also UC San, UCSD

Latent Fourier Transform

Control the groove: a latent-space Fourier transform lets you remix and blend musical styles by directly manipulating the frequency components of musical structure.

Mason Wang, Cheng-Zhi Anna Huang, C. Huang

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

Ho-Lam Chung +2Apr 20, 2026

LLM-Codec: Neural Audio Codec Meets Language Model Objectives

Bridging the gap between audio reconstruction and language modeling objectives yields neural audio codecs that are both more acoustically faithful and linguistically predictable.

Ho-Lam Chung, Yiming Chen, Hung-yi Lee

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Riccardo Casciotti +5Apr 20, 2026·also PoliMi, Tampere

Incremental learning for audio classification with Hebbian Deep Neural Networks

Hebbian learning, often relegated to theory, can actually boost accuracy and stability in incremental audio classification tasks by selectively tuning network kernels.

Riccardo Casciotti, Francesco De Santis, Alberto Antonietti +3

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

B. K. Johnson +4Apr 20, 2026

Streaming Structured Inference with Flash-SemiCRF

Flash-SemiCRF slashes memory requirements for segment-level inference, making it feasible for genomic sequences over 100,000 positions.

B. K. Johnson, T. Goralski, Ayush Semwal +2

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Speech & Audio

Search

Speech & Audio - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (65)