March 25 – April 1, 2026

Speech & Audio - Weekly Roundup

42 papers published across 2 labs.

17% acceleration

Selected Labs publishing this week

CMU ML2 Tsinghua AI1

Top Papers

Mar 30, 2026

CMU ML2d ago

VAANI: Capturing the language landscape for an inclusive digital India

VAANI's open-sourced dataset offers unprecedented coverage of India's linguistic landscape, finally giving researchers the data needed to build truly inclusive speech models.

Sujith Pulikodan, Abhayjeet Singh, Agneedh Basu +275

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Mar 31, 2026

Shifang Zhao +41d ago

CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Forget tedious manual editing: CutClaw's multi-agent system can automatically transform hours of raw footage into engaging, rhythm-aligned short videos.

Shifang Zhao, Yihan Hu, Ying Shan +2

Multimodal Models Speech & Audio Tool Use & Agents

Daniel Williams1d ago

Real-Time Band-Grouped Vocal Denoising Using Sigmoid-Driven Ideal Ratio Masking

Real-time vocal denoising is now possible with deep learning, achieving significant SNR improvements at under 10ms latency.

Daniel Williams

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Mohammad Mohammadamini1d ago

FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish

Northern Kurdish finally gets its due with FLEURS-Kobani, a new benchmark dataset that exposes the challenges and opportunities for ASR and speech translation in this under-resourced language.

Mohammad Mohammadamini

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Paige Tuttösí +41d ago

Covertly improving intelligibility with data-driven adaptations of speech timing

Global speech slowing, a common strategy for improving intelligibility, is outperformed by targeted, data-driven speech rate adjustments that listeners don't even consciously notice.

Paige Tuttösí, Angelica Lim, H. Henny Yeung +2

Natural Language Processing Speech & Audio

All Papers (42)

Mar 31, 2026

Shifang Zhao +41d ago

CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Forget tedious manual editing: CutClaw's multi-agent system can automatically transform hours of raw footage into engaging, rhythm-aligned short videos.

Shifang Zhao, Yihan Hu, Ying Shan +2

Multimodal Models Speech & Audio Tool Use & Agents

Daniel Williams1d ago

Real-Time Band-Grouped Vocal Denoising Using Sigmoid-Driven Ideal Ratio Masking

Real-time vocal denoising is now possible with deep learning, achieving significant SNR improvements at under 10ms latency.

Daniel Williams

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Mohammad Mohammadamini1d ago

FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish

Northern Kurdish finally gets its due with FLEURS-Kobani, a new benchmark dataset that exposes the challenges and opportunities for ASR and speech translation in this under-resourced language.

Mohammad Mohammadamini

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Paige Tuttösí +41d ago

Covertly improving intelligibility with data-driven adaptations of speech timing

Global speech slowing, a common strategy for improving intelligibility, is outperformed by targeted, data-driven speech rate adjustments that listeners don't even consciously notice.

Paige Tuttösí, Angelica Lim, H. Henny Yeung +2

Natural Language Processing Speech & Audio

Tobias Bystrich +51d ago

Can LLM Agents Identify Spoken Dialects like a Linguist?

LLMs can classify dialects with surprising accuracy when given linguistic hints, suggesting a new way to leverage their knowledge for low-resource language tasks.

Tobias Bystrich, Lukas Hamm, Maria Hassan +3

Natural Language Processing Speech & Audio Tool Use & Agents

Hillary Mutisya +41d ago

The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

Thiomi slashes Swahili ASR error rates by 61% and unlocks nine more African languages for multimodal AI, proving community-driven data collection can leapfrog existing benchmarks.

Hillary Mutisya, J. Mugane, Gavin Nyamboga +2

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Lukuang Dong +41d ago

Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition

LLMs can achieve state-of-the-art multilingual speech recognition by smartly handling noisy phoneme inputs, even with severe data imbalance across languages.

Lukuang Dong, Ziwei Li, Saierdaer Yusuyin +2

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Mohammad Khalil +31d ago

SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation

The first publicly available dataset for Syrian Arabic Sign Language (SyArSL) opens the door for machine translation research to improve accessibility for a historically underserved community.

Mohammad Khalil, R. Nahas, A. Nassar +1

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Tsinghua AI1d ago·also ByteDance, Rice

From Natural Alignment to Conditional Controllability in Multimodal Dialogue

Current multimodal dialogue models struggle to capture the nuanced expressiveness of human interaction, but a new dataset and benchmark reveal exactly where they fall short.

Zeyu Jin, Songtao Zhou, Haoyu Wang +5

Multimodal Models Natural Language Processing Speech & Audio

Mingyeong Song +21d ago

SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

Turn monaural video into immersive binaural audio with SIREN, a visually-guided framework that learns spatial audio cues without task-specific annotations.

Mingyeong Song, Seoyeon Ko, Junhyug Noh

Computer Vision Multimodal Models Speech & Audio

Mahesh Ramani1d ago

A Comprehensive Corpus of Biomechanically Constrained Piano Chords: Generation, Analysis, and Implications for Voicing and Psychoacoustics

Forget "spread" voicings: skewness is the key to clarity in piano chords, offering a fresh perspective on psychoacoustic principles.

Mahesh Ramani

Data Curation & Synthetic Data Scientific Discovery & Drug Design Speech & Audio

Detai Xin +61d ago

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

Ditching mel-spectrograms unlocks surprisingly better text-to-speech, as LongCat-AudioDiT proves that waveform latent diffusion can beat the state-of-the-art in zero-shot voice cloning.

Detai Xin, Shujie Hu, Chengzuo Yang +4

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

U.H Shin +11d ago

Asymmetric Encoder-Decoder Based on Time-Frequency Correlation for Speech Separation

By disentangling speakers earlier in the process, SR-CorrNet avoids the information bottleneck that plagues existing speech separation models, leading to improved performance in challenging acoustic environments.

U.H Shin, Hyung-Min Park

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

1d ago·also IIT Delhi, Indraprastha Institute of Information, Jaypee Institute of Information

Audio Hallucination Attacks: Probing the Reliability of Large Audio Language Models

State-of-the-art Large Audio Language Models are surprisingly vulnerable to hallucination attacks, with success rates as high as 95%, revealing a critical reliability gap masked by standard benchmarks.

Ashish Seth, Sonal Kumar, Ramaneswaran Selvakumar +5

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Speech & Audio

Y. E. Kheir +71d ago

IQRA 2026: Interspeech Challenge on Automatic Assessment Pronunciation for Modern Standard Arabic (MSA)

Arabic mispronunciation detection just got a whole lot better: F1-scores jumped by 0.28 thanks to novel architectures and a new dataset of authentic mispronunciations.

Y. E. Kheir, Amit Meghanani, Mostafa Shahin +5

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Mar 30, 2026

Amber Cassimon +32d ago

Intelligent Road Condition Monitoring using 3D In-Air SONAR Sensing

SONAR can "see" road damage and material even when cameras and LiDAR are blinded by rain or fog.

Amber Cassimon, Robin Kerstens, Walter Daems +1

Computer Vision Robotics & Embodied AI Speech & Audio

L. Curini +52d ago

Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

VLMs can unlock insights from troves of historical documents previously inaccessible due to OCR limitations, achieving state-of-the-art transcription and speaker tagging of Italian parliamentary speeches.

L. Curini, Luigi Curini, Alfio Ferrara +3

Multimodal Models Natural Language Processing Speech & Audio

Ganesh Pavan Kartikeya Bharadwaj Kolluri +22d ago

On the Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR

You can slash 7-14% of parameters from your SLAM-ASR system by pruning the Whisper encoder and using LoRA, even outperforming the original model in some cases.

Ganesh Pavan Kartikeya Bharadwaj Kolluri, Michael Kampouridis, Ravi Shekhar

Inference & Quantization Natural Language Processing Speech & Audio

2d ago·also TU Darmstadt

Voice-Controlled Scratch for Children with (Motor) Disabilities

Voice control, previously insufficient for block-based programming, can now enable children with motor disabilities to effectively use Scratch, thanks to a novel multi-stage speech recognition pipeline.

Elias Goller, Gordon Fraser, Isabella Graßl

Code Generation & Program Synthesis Natural Language Processing Speech & Audio

Milton Zhou +32d ago

AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation

Forget disjointed workflows: AutoCut's unified token space for video, audio, and text slashes ad production costs while boosting consistency.

Milton Zhou, Sizhong Qin, Yongzhi Li +1

Computer Vision Multimodal Models Speech & Audio

S. D. Torre +42d ago

On the Usefulness of Diffusion-Based Room Impulse Response Interpolation to Microphone Array Processing

Diffusion models can now reliably fill in the gaps in real-world spatial audio data, boosting the performance of microphone arrays.

S. D. Torre, Sagi Della Torre, Mirco Pezzoli +2

Computer Vision Speech & Audio

Sofiane Azzouz +52d ago

Acoustic-to-articulatory Inversion of the Complete Vocal Tract from RT-MRI with Various Audio Embeddings and Dataset Sizes

Unlock a complete picture of vocal tract articulation from speech using MRI data, surpassing the limitations of traditional sensor-based methods.

Sofiane Azzouz, Sofiane Azzouz, Pierre-André Vuissoz +3

Computer Vision Multimodal Models Speech & Audio

CMU ML2d ago

An Empirical Recipe for Universal Phone Recognition

Forget hand-tuning for each language: this recipe achieves state-of-the-art phone recognition across 100+ languages, revealing the surprising power of scaling data and SSL representations.

Shikhar Bharadwaj, Chin-Jou Li, Kwanghee Choi +4

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Anuj Diwan +52d ago

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Style-controllable speech synthesis just got a major upgrade: ParaSpeechCLAP lets you dial in nuanced speaker traits and situational contexts far beyond what existing models can handle.

Anuj Diwan, Anuj Diwan, Eunsol Choi +3

Multimodal Models Natural Language Processing Speech & Audio

Ludovica Pannitto +92d ago

Coconstructions in spoken data: UD annotation guidelines and first results

Finally, a way to represent the messy, collaborative syntax of real spoken language in treebanks.

Ludovica Pannitto, Sylvain Kahane, S. Kahane +7

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Derong Jin +62d ago

SonoWorld: From One Image to a 3D Audio-Visual Scene

Now you can turn a single image into a navigable 3D world complete with spatial audio, opening the door to richer immersive experiences.

Derong Jin, Derong Jin, Xiyi Chen +4

Computer Vision Multimodal Models Speech & Audio

Chenhao Xue +92d ago

Constructing Composite Features for Interpretable Music-Tagging

Evolving interpretable composite features via Genetic Programming beats black-box deep learning at music tagging, revealing synergistic interactions and transformations that boost performance.

Chenhao Xue, Weitao Hu, Wei Hu +7

Interpretability & Mechanistic Interp Speech & Audio

Michele Banfi +52d ago

EBuddy: a workflow orchestrator for industrial human-machine collaboration

Scale expert know-how in tool-intensive industrial workflows with a voice-guided system that cuts process time and boosts repeatability.

Michele Banfi, Rocco Felici, Stefano Baraldo +3

Robotics & Embodied AI Speech & Audio Tool Use & Agents

2d ago

Users and Wizards in Conversations: How WoZ Interface Choices Define Human-Robot Interactions

VR telepresence in Wizard-of-Oz studies doesn't just feel more immersive, it fundamentally changes the interaction dynamics, fostering stronger social connections and more natural conversational flow compared to traditional GUI-based interfaces.

Ekaterina Torubarova, Jura Miniota, André Pereira +1

Natural Language Processing Robotics & Embodied AI Speech & Audio

Run Chen +82d ago

Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought

Grounding audio language models with acoustic feature representations unlocks more accurate and explainable deepfake detection, even with smaller models.

Run Chen, Runkun Chen, Yixiong Fang +6

Natural Language Processing Reasoning & Chain-of-Thought Speech & Audio

Marco Hidalgo-Araya +82d ago

A Probabilistic Generative Model for Spectral Speech Enhancement

Achieve competitive speech enhancement with a highly compact (85 parameter) probabilistic model that continuously adapts to user and environment, suggesting a path towards truly personalized and adaptive hearing aids.

Marco Hidalgo-Araya, Marco D. Hidalgo-Araya, Raphaël Trésor +6

Natural Language Processing Speech & Audio

Jia-Kai Dong +32d ago

Membership Inference Attacks against Large Audio Language Models

LALMs leak speaker identity by memorizing the link between voice and text, not just the content of speech.

Jia-Kai Dong, Jiatang Dong, Yu-Xiang Lin +1

Natural Language Processing Red-Teaming & Adversarial Robustness Speech & Audio

2d ago

MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions

Cinematic speech data unlocks more realistic and controllable voice generation from natural language descriptions.

Kexin Huang, Liwei Fan, Botian Jiang +11

Natural Language Processing Speech & Audio

Ashwini Dasare +52d ago

Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

Skip expensive human ratings: this hierarchical multimodal model accurately predicts human perception of AI-dubbed content quality using only audio, video, and text inputs.

Ashwini Dasare, Nirmesh Shah, Ashish Gudmalwar +3

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

CMU ML2d ago

VAANI: Capturing the language landscape for an inclusive digital India

VAANI's open-sourced dataset offers unprecedented coverage of India's linguistic landscape, finally giving researchers the data needed to build truly inclusive speech models.

Sujith Pulikodan, Abhayjeet Singh, Agneedh Basu +275

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Shaoheng Xu +92d ago

BiFormer3D: Grid-Free Time-Domain Reconstruction of Head-Related Impulse Responses with a Spatially Encoded Transformer

Ditch the grid: BiFormer3D uses a spatial-encoding Transformer to reconstruct personalized 3D audio from sparse measurements, outperforming prior art without relying on frequency-domain hacks or minimum-phase assumptions.

Shaoheng Xu, Chunyi Sun, J. Zhang +7

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

Mar 29, 2026

Benno Weck +43d ago

HumMusQA: A Human-written Music Understanding QA Benchmark Dataset

LALMs still struggle to truly "hear" music, as revealed by a new expert-curated benchmark that exposes their reliance on non-musical shortcuts.

Benno Weck, Pablo Puentes, Andrea Poltronieri +2

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Speech & Audio

College of Computer Science3d ago

MAR3: Multi-Agent Recognition, Reasoning, and Reflection for Reference Audio-Visual Segmentation

LLMs can achieve state-of-the-art audio-visual segmentation without any training by using a multi-agent system that explicitly reasons about expression difficulty and validates segmentation results.

Yuan Zhao, Zhenqi Jia, Yongqiang Zhang

Computer Vision Multimodal Models Speech & Audio

Ojas Chaturvedi +73d ago

Advancing Multi-Instrument Music Transcription: Results from the 2025 AMT Challenge

Despite progress, accurately transcribing music with multiple instruments, complex polyphony, and diverse timbres remains a significant hurdle for AI.

Ojas Chaturvedi, Kayshav Bhardwaj, Tanay Gondil +5

Eval Frameworks & Benchmarks Speech & Audio

Xinyuan Xie +63d ago

EvA: An Evidence-First Audio Understanding Paradigm for LALMs

LALMs struggle more with *hearing* the evidence than *reasoning* about it, and EvA's evidence-first fusion architecture proves it.

Xinyuan Xie, Shunian Chen, Zhiheng Liu +4

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Lam Pham +93d ago

A General Model for Deepfake Speech Detection: Diverse Bonafide Resources or Diverse AI-Based Generators

Balancing the diversity of real and AI-generated speech data is the key to building deepfake detectors that actually generalize.

Lam Pham, Khoi Vu, Dat Tran +7

Red-Teaming & Adversarial Robustness Speech & Audio

Xiangyuan Xue +53d ago

Investigation on the Robustness of Acoustic Foundation Models on Post Exercise Speech

Turns out your fancy speech recognition model might stumble after a workout: performance degrades significantly on post-exercise speech, and the best model varies depending on whether you fine-tune it.

Xiangyuan Xue, Yuyu Wang, Ruijie Yao +3

Eval Frameworks & Benchmarks Speech & Audio

Search

Speech & Audio - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (42)