Speech & Audio

Applications

Speech recognition, text-to-speech, audio generation, music AI, and spoken language understanding.

Keywords

speech recognitiontext-to-speechTTSaudio generationmusic generationASRspoken languagevoice cloning

Recent Papers

Feb 12, 2026

2d ago

Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications

The paper introduces Moonshine v2, an ergodic streaming encoder ASR model designed for latency-critical speech applications, particularly on resource-constrained edge devices. It addresses the latency issues of full-attention Transformer encoders by employing sliding-window self-attention, enabling bounded, low-latency inference while maintaining strong local context. Experiments demonstrate that Moonshine v2 achieves state-of-the-art word error rates on standard benchmarks, matching the accuracy of models six times larger while running significantly faster.

Introduces an ergodic streaming encoder ASR model, Moonshine v2, that uses sliding-window self-attention to achieve low-latency and high accuracy for on-device speech recognition.

M. Kudlur, Evan King, James Wang +12602.12241

Speech & AudioArchitecture Design (Transformers, SSMs, MoE)Inference & Quantization

2d ago

WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

The paper introduces WavBench, a new benchmark for end-to-end spoken dialogue models that evaluates reasoning, colloquialism, and paralinguistics, addressing limitations of existing text-centric benchmarks. WavBench comprises three subsets: Pro (reasoning), Basic (colloquialism), and Acoustic (paralinguistics), designed to assess complex problem-solving, natural language fluency, and nuanced understanding/generation of acoustic cues. Evaluation of five state-of-the-art models using WavBench reveals critical insights into model performance across these dimensions, highlighting areas for improvement in building more robust spoken dialogue agents.

Introduces WavBench, a novel benchmark dataset and evaluation toolkit designed to comprehensively assess reasoning, colloquialism, and paralinguistic capabilities in end-to-end spoken dialogue models.

Yangzhuo Li, Yifu Chen, Haorong Ying +32602.12135

Eval Frameworks & BenchmarksReasoning & Chain-of-ThoughtSpeech & Audio

2d ago

Cross-Modal Robustness Transfer (CMRT): Training Robust Speech Translation Models Using Adversarial Text

The paper introduces Cross-Modal Robustness Transfer (CMRT) to improve the robustness of End-to-End Speech Translation (E2E-ST) models against morphological variations. CMRT leverages adversarial training in the text modality to transfer robustness to the speech modality, eliminating the need for computationally expensive adversarial speech data generation. Experiments across four language pairs show that CMRT improves adversarial robustness by over 3 BLEU points compared to baseline E2E-ST models.

Introduces Cross-Modal Robustness Transfer (CMRT), a novel framework for enhancing E2E-ST model robustness by transferring adversarial robustness from text to speech.

Abderrahmane Issam, Yusuf Can Semerci, Jan Scholtes +12602.11933

Red-Teaming & Adversarial RobustnessMultimodal ModelsSpeech & Audio

2d ago

When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

The paper investigates modality arbitration in Audio-LLMs, revealing a strong bias towards text over audio when the two modalities conflict, even when audio quality is superior. Using the ALME benchmark, the authors demonstrate that Gemini 2.0 Flash exhibits significantly higher text dominance in audio-text conflicts compared to text-text conflicts. They propose that this text dominance arises from an asymmetry in arbitration accessibility rather than information content, and provide evidence through interventions like forced transcription and fine-tuning ablations.

Reveals and analyzes a significant text dominance bias in audio-LLMs during modality arbitration, attributing it to differences in representational accessibility rather than information content.

Jayadev Billa2602.11488

Eval Frameworks & BenchmarksMultimodal ModelsSpeech & Audio

2d ago

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

The paper investigates the failure of speech recognition models on transcribing U.S. street names, finding a 44% error rate across 15 models from major vendors and disproportionately larger routing distance errors for non-English primary speakers. It highlights the gap between benchmark performance and real-world reliability, particularly for high-stakes tasks involving named entities. The authors then demonstrate that fine-tuning with a small, synthetically generated dataset of diverse pronunciations improves street name transcription accuracy by nearly 60% for non-English primary speakers.

Demonstrates that speech recognition models exhibit significant transcription errors on street names, particularly impacting non-English speakers, and mitigates this issue through synthetic data augmentation.

Martijn Bartelds, Federico Bianchi2602.12249

Eval Frameworks & BenchmarksNatural Language ProcessingSpeech & Audio

2d ago

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

The paper introduces DreamID-Omni, a unified framework for human-centric audio-video generation, addressing tasks like reference-based generation, video editing, and audio-driven animation within a single model. It tackles the challenge of disentangling character identities and voice timbres by employing a Dual-Level Disentanglement strategy and a Symmetric Conditional Diffusion Transformer. Experimental results demonstrate state-of-the-art performance in video, audio, and audio-visual consistency, surpassing even proprietary commercial models.

Introduces a unified framework, DreamID-Omni, that achieves state-of-the-art performance on a range of human-centric audio-video generation tasks by disentangling identity and timbre control.

Xu Guo, Fulong Ye, Qichao Sun +42602.12160

Multimodal ModelsComputer VisionSpeech & Audio

2d ago

WaveFormer: Wavelet Embedding Transformer for Biomedical Signals

The paper introduces WaveFormer, a transformer architecture tailored for biomedical signal classification, addressing limitations of standard transformers in capturing multi-scale frequency patterns in long sequences. WaveFormer incorporates wavelet decomposition in both the embedding construction via multi-channel DWT and positional encoding via Dynamic Wavelet Positional Encoding (DyWPE). Experiments across eight datasets for human activity recognition and brain signal analysis demonstrate WaveFormer's competitive performance by effectively integrating frequency-domain information.

Introduces a novel transformer architecture, WaveFormer, that integrates wavelet decomposition into both the embedding and positional encoding stages to improve biomedical signal classification.

Habib Irani, Bikram De, V. Metsis2602.12189

Architecture Design (Transformers, SSMs, MoE)Speech & AudioTraining Efficiency & Optimization

2d ago

A$^{2}$V-SLP: Alignment-Aware Variational Modeling for Disentangled Sign Language Production

The paper introduces A$^{2}$V-SLP, an alignment-aware variational framework for sign language production that learns disentangled latent distributions for each articulator. This approach uses a disentangled VAE to encode sign pose sequences and extract articulator-specific mean and variance vectors, which then serve as distributional supervision for a non-autoregressive Transformer that predicts latent means and log-variances from text embeddings. By employing stochastic sampling and a gloss attention mechanism, A$^{2}$V-SLP achieves state-of-the-art back-translation performance and enhances motion realism in gloss-free sign language production.

Introduces an alignment-aware variational framework (A$^{2}$V-SLP) that learns disentangled latent distributions for sign language production, improving back-translation performance and motion realism.

Sumeyye Meryem Tacsyurek, Enis Mucahid .Iskender, H. Keles2602.11861

Multimodal ModelsComputer VisionSpeech & Audio

2d ago

TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR

The paper introduces Trans-Chunk BiMamba (TC-BiMamba), a novel architecture for unified streaming and non-streaming automatic speech recognition (ASR) that addresses the limitations of existing BiMamba-based streaming methods which are restricted to fixed chunk sizes. TC-BiMamba employs a trans-chunk mechanism to train bidirectional sequences offline with dynamic chunk sizes, enabling a single model to handle both offline and streaming decoding with varying latency requirements. Experiments demonstrate that TC-BiMamba achieves a 1.3x training speedup, reduces memory consumption by 50%, and improves ASR performance compared to chunk-wise processing, while also outperforming U2++ and matching LC-BiMamba with a smaller model size.

Introduces the Trans-Chunk BiMamba (TC-BiMamba) architecture, enabling efficient dynamic chunk size training for unified streaming and non-streaming ASR.

Qingshun She, Yangui Fang, Yu Xi2602.11546

Architecture Design (Transformers, SSMs, MoE)Speech & AudioTraining Efficiency & Optimization

2d ago

Musical Metamerism with Time--Frequency Scattering

This paper introduces the concept of "musical metamerism," analogous to visual metamerism, where dissimilar audio waveforms produce similar auditory sensations. The authors present a method for generating musical metamers from audio recordings using joint time-frequency scattering (JTFS) implemented in the Kymatio Python library. The method's key advantage is its lack of reliance on manual preprocessing steps like transcription or source separation.

Introduces a novel method for generating musical metamers using joint time-frequency scattering, eliminating the need for manual audio preprocessing.

Vincent Lostanlen2602.11896

Speech & Audio

2d ago

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

The paper introduces SLD-L2S, a novel lip-to-speech (L2S) framework based on a hierarchical subspace latent diffusion model that directly maps visual lip movements to the continuous latent space of a pre-trained neural audio codec, bypassing intermediate representations. The method employs a hierarchical architecture with parallel subspaces and a diffusion convolution block (DiCB) to enhance interactions within and between subspaces. By using reparameterized flow matching, the framework incorporates speech language model (SLM) and semantic losses during training, leading to state-of-the-art generation quality on benchmark datasets.

Introduces a hierarchical subspace latent diffusion model (SLD-L2S) for lip-to-speech synthesis that directly maps visual lip movements to the continuous latent space of a pre-trained neural audio codec, enabling the incorporation of SLM and semantic losses via reparameterized flow matching.

Xiaodong Li2602.11477

Multimodal ModelsSpeech & AudioComputer Vision

2d ago

Exploring Frequency-Domain Feature Modeling for HRTF Magnitude Upsampling

This paper investigates the impact of explicit frequency-domain feature modeling on HRTF magnitude upsampling from sparse measurements, comparing various architectures including MLPs, CNNs, dilated CNNs, and attention-based models. The authors find that explicitly modeling spectral dependencies consistently improves reconstruction accuracy, especially under severe sparsity. They propose a frequency-domain Conformer-based architecture to capture both local spectral continuity and long-range frequency correlations, achieving state-of-the-art performance on the SONICOM and HUTUBS datasets.

Demonstrates the importance of explicit frequency-domain feature modeling for HRTF magnitude upsampling and introduces a Conformer-based architecture that leverages both local and long-range spectral dependencies.

Hanwen Bi, Sipei Zhao, Eva Cheng +12602.11670

Speech & AudioArchitecture Design (Transformers, SSMs, MoE)

2d ago

Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning

The paper introduces audio-interleaved reasoning for Large Audio Language Models (LALMs) to overcome the information bottleneck of one-time audio encoding. They propose a two-stage training framework involving supervised fine-tuning for salient audio segment localization and reinforcement learning to encourage re-listening. The resulting LALM, Echo, demonstrates improved performance on audio comprehension benchmarks, showcasing the benefits of dynamic audio re-listening during reasoning.

Introduces and validates audio-interleaved reasoning, enabling LALMs to actively re-listen to audio during the reasoning process, thereby improving audio comprehension.

Jiashu Yao, Qingsong Liu, Sicheng Zhao +22602.11909

Reasoning & Chain-of-ThoughtMultimodal ModelsSpeech & Audio

2d ago

TADA! Tuning Audio Diffusion Models through Activation Steering

This paper investigates the internal representations of high-level musical concepts within audio diffusion models using activation patching, revealing that a small subset of attention layers controls distinct semantic concepts. They then use Contrastive Activation Addition and Sparse Autoencoders in these key layers to achieve more precise control over audio generation. The authors demonstrate the ability to manipulate specific musical elements like tempo and mood by steering activations in the identified layers.

Demonstrates precise control over generated audio by identifying and steering activations in specific attention layers of audio diffusion models.

Lukasz Staniszewski, Katarzyna Zaleska, Mateusz Modrzejewski +12602.11910

Interpretability & Mechanistic InterpSpeech & AudioArchitecture Design (Transformers, SSMs, MoE)

Feb 11, 2026

3d ago

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

The authors introduce CAT (Causal Audio Tokenizer with Transformer), a fully Transformer-based architecture for end-to-end learning of discrete audio tokenizers, jointly optimizing the encoder, quantizer, and decoder. They then scale CAT to create MOSS-Audio-Tokenizer, a 1.6B parameter model pre-trained on 3M hours of audio data. Results demonstrate that MOSS-Audio-Tokenizer achieves state-of-the-art audio reconstruction across speech, sound, and music, and enables competitive ASR and autoregressive TTS performance.

Introduces and validates a scalable, fully end-to-end Transformer architecture (CAT) for discrete audio tokenization that outperforms existing codecs and enables downstream audio tasks.

Zhaoye Fei, Xiaogui Yang, Yang Wang +52602.10934

Multimodal ModelsArchitecture Design (Transformers, SSMs, MoE)Speech & Audio

3d ago·affiliated lab: Google Research

Voxtral Realtime

The paper introduces Voxtral Realtime, a novel automatic speech recognition (ASR) model designed for native streaming with sub-second latency. Unlike chunking-based approaches, Voxtral Realtime is trained end-to-end for streaming with explicit audio-text alignment, leveraging the Delayed Streams Modeling framework. The model incorporates a new causal audio encoder and Ada RMS-Norm for improved delay conditioning, and achieves performance comparable to Whisper at a 480ms delay after large-scale pretraining across 13 languages.

Presents Voxtral Realtime, a natively streaming ASR model that matches offline transcription quality at sub-second latency through end-to-end training and explicit audio-text stream alignment.

Andy Ehrenberg, Andy Lo, Guillaume Lample +1172602.11298

Speech & AudioNatural Language ProcessingArchitecture Design (Transformers, SSMs, MoE)

Feb 10, 2026

4d ago

Stemphonic: All-at-once Flexible Multi-stem Music Generation

The paper introduces Stemphonic, a diffusion/flow-based framework for generating a variable set of synchronized music stems in a single inference pass, addressing the limitations of existing parallel or sequential stem generation methods. Stemphonic achieves this by treating each stem as a batch element during training, grouping synchronized stems, and applying a shared noise latent to each group, enabling efficient multi-stem generation conditioned on stem-specific text inputs. Experiments on open-source stem evaluation sets demonstrate that Stemphonic generates higher-quality outputs and accelerates full mix generation by 25-50%.

Introduces a novel diffusion/flow-based architecture, Stemphonic, that generates a variable number of synchronized music stems in a single pass, improving both generation speed and output quality compared to existing methods.

Ge Zhu, Juan-Pablo Caceres, Nicholas J. Bryan2602.09891

Speech & AudioArchitecture Design (Transformers, SSMs, MoE)

Feb 9, 2026

5d ago

NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

The paper introduces NarraScore, a hierarchical framework for generating soundtracks for long-form videos by leveraging emotion as a compressed representation of narrative logic. It uses frozen Vision-Language Models (VLMs) to extract Valence-Arousal trajectories from video and employs a Dual-Branch Injection strategy, consisting of a Global Semantic Anchor and a Token-Level Affective Adapter, to control musical dynamics. Experiments show that NarraScore achieves state-of-the-art consistency and narrative alignment with minimal computational cost.

Introduces a hierarchical framework, NarraScore, that leverages VLMs and a dual-branch injection strategy to generate narrative-aligned soundtracks for long-form videos.

Yufan Wen, Ziyi Guo, Lihua Zhang +22602.09070

Multimodal ModelsSpeech & AudioComputer Vision

5d ago

MOVA: Towards Scalable and Synchronized Video-Audio Generation

The paper introduces MOVA, an open-source Mixture-of-Experts (MoE) model with 32B parameters (18B active) designed for synchronized video and audio generation. MOVA addresses the limitations of cascaded pipelines and closed-source systems by enabling simultaneous generation of high-quality audio-visual content, including lip-synced speech and environment-aware sound effects. The model supports Image-Text to Video-Audio (IT2VA) generation and is released with code for efficient inference, LoRA fine-tuning, and prompt enhancement.

Introduces MOVA, an open-source MoE model for synchronized video and audio generation, facilitating research and development in joint multimodal modeling.

SII-OpenMOSS Team Donghua Yu, Mingshu Chen, Qi Chen +372602.08794

Multimodal ModelsComputer VisionSpeech & Audio

Jan 27, 2026

National University of Science and Technology “Politehnica” Bucharest2w ago

A Comparative Study of Emotion Recognition Systems: From Classical Approaches to Multimodal Large Language Models

This paper surveys the evolution of Emotion Recognition in Video (ERV) systems, tracing the shift from handcrafted features and task-specific deep learning models to transformer-based vision-language models and multimodal large language models (MLLMs). It analyzes multimodal fusion strategies, dataset characteristics, and evaluation protocols, while highlighting limitations related to robustness, bias, and annotation quality. The review compares task-specific models with foundation model approaches, clarifying their strengths and weaknesses for different application contexts, and outlines future research directions for robust and efficient ERV systems.

Systematically reviews and compares the evolution of emotion recognition in video, contrasting classical approaches with emerging multimodal large language model-based methods, and identifies key challenges for real-world deployment.

Mirela-Magdalena Grosu (Marinescu), O. Datcu, Ruxandra Tapu +1

Multimodal ModelsComputer VisionSpeech & Audio

Jan 7, 2026

Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control

The authors introduce Muse, an open-source system for long-form song generation with fine-grained style conditioning, addressing the lack of reproducibility in academic research due to unavailable training data. They release a dataset of 116k fully licensed synthetic songs with lyrics and style descriptions paired with SunoV5-synthesized audio. Muse, a Qwen-based language model finetuned with discrete audio tokens, achieves competitive performance in phoneme error rate, text-music style similarity, and audio aesthetic quality, demonstrating controllable segment-level generation.

Releases Muse, a fully open-source system for long-form song generation, along with a licensed synthetic dataset and training/evaluation pipelines, to enable reproducible research.

Changhao Jiang, Jiahao Chen, Zhenghao Xiang +142601.03973

Speech & AudioOpen-Source Models & WeightsData Curation & Synthetic Data

Jan 6, 2026

LTX-2: Efficient Joint Audio-Visual Foundation Model

LTX-2, a new open-source foundational model, generates high-quality, temporally synchronized audiovisual content by employing an asymmetric dual-stream transformer architecture with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers. The model incorporates temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning, along with a multilingual text encoder and modality-aware classifier-free guidance (modality-CFG) to improve audiovisual alignment and controllability. Evaluations demonstrate that LTX-2 achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, rivaling proprietary models with significantly reduced computational cost and inference time.

Introduces LTX-2, an efficient open-source audiovisual foundation model that achieves state-of-the-art quality with improved alignment and controllability through an asymmetric dual-stream transformer architecture and modality-aware classifier-free guidance.

Yoav HaCohen, Benny Brazowski, Nisan Chiprut +2662601.03233

Multimodal ModelsArchitecture Design (Transformers, SSMs, MoE)Speech & Audio

Dec 1, 2025

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

The paper introduces AV-SpeakerBench, a new benchmark designed to evaluate fine-grained audiovisual reasoning capabilities of MLLMs, specifically focusing on understanding human speech in videos. The benchmark consists of 3,212 multiple-choice questions centered on speaker-centric reasoning, requiring models to align who speaks, what is said, and when it occurs. Experiments demonstrate that the Gemini family of models outperforms open-source alternatives, with Gemini 2.5 Pro achieving the highest accuracy, while highlighting the challenges open models face in audiovisual fusion.

Introduces AV-SpeakerBench, a novel benchmark for evaluating speaker-centric audiovisual reasoning in MLLMs, designed to assess fine-grained understanding of human speech in real-world videos.

Nguyen Le Minh, Zhuoran Yu, Samuel Low Yu Hang +82512.02231

Eval Frameworks & BenchmarksMultimodal ModelsSpeech & Audio

Oct 29, 2025

TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

The paper introduces TwinVoice, a multi-dimensional benchmark designed to evaluate the persona simulation capabilities of Large Language Models (LLMs) across social, interpersonal, and narrative contexts. TwinVoice decomposes persona simulation into six fundamental capabilities, including opinion consistency, memory recall, and syntactic style, providing a granular assessment framework. Experiments using TwinVoice reveal that while LLMs demonstrate moderate accuracy, they significantly underperform in areas like syntactic style and memory recall compared to human baselines, highlighting areas for future research.

Introduces TwinVoice, a novel benchmark for evaluating LLM-based persona simulation across diverse real-world contexts and decomposed into fundamental capabilities.

Bangde Du, Minghao Guo, Songming He +82510.25536

Eval Frameworks & BenchmarksNatural Language ProcessingSpeech & Audio

Sep 24, 2025

MMedFD: A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition

The paper introduces MMedFD, a new real-world Chinese healthcare ASR dataset designed for multi-turn, full-duplex conversations, addressing the scarcity of open benchmarks for clinical dialogue ASR. The dataset contains 5,805 annotated sessions with synchronized user and mixed-channel audio, along with RTTM/CTM timing and role labels. The authors also present a model-agnostic pipeline for streaming segmentation, speaker attribution, and dialogue memory, and fine-tune Whisper-small on role-concatenated audio, providing a reproducible benchmark for streaming ASR and end-to-end duplex agents.

Introduces MMedFD, the first publicly available real-world Chinese healthcare ASR corpus for multi-turn, full-duplex scenarios, complete with annotations, evaluation metrics, and a baseline pipeline.

Hongzhao Chen, XiaoYang Wang, Jing Lan +92509.19817

Eval Frameworks & BenchmarksSpeech & AudioNatural Language Processing

Sep 23, 2025

FlexSED: Towards Open-Vocabulary Sound Event Detection

The paper introduces FlexSED, an open-vocabulary sound event detection system that addresses the limitations of traditional multi-class SED frameworks by enabling free-text sound queries and improving zero/few-shot learning. FlexSED leverages a pretrained audio SSL model and the CLAP text encoder, incorporating an encoder-decoder architecture and adaptive fusion for continuous training. By using LLMs to refine event query selection for training, FlexSED achieves state-of-the-art performance on AudioSet-Strong and exhibits strong zero-shot and few-shot capabilities.

Introduces an open-vocabulary sound event detection system, FlexSED, which effectively integrates pretrained audio and text encoders with LLM-assisted training to achieve superior performance and generalization.

Jiarui Hai, Helin Wang, Weizhe Guo +12509.18606

Speech & AudioMultimodal Models

Sep 12, 2025

School of Information EngineeringSep 12, 2025

A Multimodal Video Understanding Agent Based on Video-Audio Multi-Task Joint Fine-Tuning and State Machine Scheduling

This paper introduces a LangGraph-based multimodal video understanding agent that leverages a state-machine architecture for intelligent routing and multi-task processing of video, audio, and subtitles. They fine-tune the Qwen2.5-VL model on a newly constructed multimodal video understanding dataset, which encompasses perception, temporal reasoning, and higher-level inference tasks. Results show significant performance improvements on video question answering and temporal localization, demonstrating the benefits of multi-task learning and intelligent routing.

Introduces a novel LangGraph-based agent architecture that intelligently routes and processes multimodal video data using a state-machine and multi-task fine-tuned VLM.

Zihang Chen, Jingyao Chai, Tianshuo Li +1

Multimodal ModelsTool Use & AgentsSpeech & Audio

Aug 4, 2025

A large language model digital patient system enhances ophthalmology history taking skills

This paper introduces a large language model-based digital patient (LLMDP) system that converts de-identified electronic health records into interactive, voice-enabled virtual patients for ophthalmology training. The LLMDP system, built upon a retrieval-augmented framework, allows for free-text dialogue and adaptive feedback. A randomized controlled trial (N=84) demonstrated that students trained with the LLMDP system significantly improved their medical history-taking assessment scores and empathy compared to those using traditional methods, suggesting a scalable and effective approach to medical education.

Demonstrates that a large language model-based digital patient system significantly enhances medical history-taking skills and empathy in ophthalmology trainees compared to traditional training methods.

Ming-Jie Luo, Shaowei Bi, Jianyu Pang +1713

Natural Language ProcessingTool Use & AgentsSpeech & Audio

Aug 4, 2025

MERaLiON-AudioLLM: Advancing Speech and Language Understanding for Singapore

The authors introduce MERaLiON-AudioLLM, a large language model trained on 62 million multimodal instruction samples (260k hours of audio) to understand Singlish and perform diverse audio-based tasks. This model addresses the gap in region-specific AI capable of understanding colloquial and code-switched language. MERaLiON-AudioLLM demonstrates competitive performance in ASR, spoken question answering, speech translation, and paralinguistic analysis, particularly excelling in local speech recognition compared to existing open-source models.

Presents MERaLiON-AudioLLM, the first general-purpose, multitask audio-based LLM designed to understand Singlish.

Yingxu He, Zhuohan Liu, Geyu Lin +6

Multimodal ModelsSpeech & AudioNatural Language Processing

Jul 15, 2025

School of Computer Science and EngineeringJul 15, 2025

Integrating Large Language Models into Robotic Autonomy: A Review of Motion, Voice, and Training Pipelines

This survey reviews the integration of large language models (LLMs) into robotic systems, focusing on locomotion, navigation, manipulation, and voice interaction. It highlights how LLMs translate natural language commands into robot actions, enable semantic planning, and support adaptive execution using frameworks like SayTap, TrustNavGPT, MapGPT, and 3D-LOTUS++. The review also covers training methodologies, benchmark datasets, and deployment architectures for bridging the sim-to-real gap and achieving cross-embodiment generalization in LLM-enhanced robotic systems.

Synthesizes a comprehensive overview of current approaches for integrating LLMs into robotic autonomy, categorizing methods by application area (locomotion, navigation, manipulation, voice) and highlighting key frameworks, datasets, and deployment considerations.

Yutong Liu, Qingquan Sun, Dhruvi Kapadia

Robotics & Embodied AITool Use & AgentsSpeech & Audio

Jul 7, 2025

Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

This paper addresses the underexplored NLP tasks of structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations, which are critical for reducing healthcare provider documentation burden. The authors evaluate the performance of both open- and closed-weight LLMs on these tasks using private and newly released open-source datasets (SYNUR and SIMORD). They also propose an agentic pipeline for generating realistic, non-sensitive nurse dictations to facilitate structured extraction of clinical observations.

Introduces SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction, respectively, and evaluates LLM performance on these tasks.

Jean-Philippe Corbeil, Asma Ben Abacha, George Michalopoulos +122507.05517

Natural Language ProcessingSpeech & AudioTool Use & Agents

Jul 7, 2025

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

The authors introduce OpenS2S, a fully open-source end-to-end large speech language model (LSLM) for empathetic speech interactions. OpenS2S builds upon the BLSP-Emo model and uses a streaming interleaved decoding architecture for low-latency speech generation. To enable end-to-end training, they develop an automated data construction pipeline that leverages LLMs and controllable TTS systems to synthesize a diverse and scalable empathetic speech dialogue corpus.

Introduces OpenS2S, a fully open-source end-to-end LSLM, along with a novel automated data construction pipeline for generating empathetic speech dialogues, to promote transparent research in empathetic speech systems.

Chen Wang, Tianyu Peng, Wen Yang +82507.05177

Speech & AudioOpen-Source Models & WeightsNatural Language Processing

Jul 4, 2025

OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction

The authors introduce OMAR-RQ, an open-source music audio representation model trained on a large dataset (330k+ hours) using self-supervision via masked token prediction. The model explores different input features and quantization strategies to learn general-purpose music representations. OMAR-RQ achieves state-of-the-art performance among open self-supervised models on a suite of music understanding tasks, including tagging, pitch estimation, and beat tracking.

Introduces OMAR-RQ, a novel open-source self-supervised model for music audio representation trained with multi-feature masked token prediction, demonstrating superior performance compared to existing open models across various MIR tasks.

Pablo Alonso-Jim'enez, Pedro Ramoneda, R. O. Araz +22507.03482

Open-Source Models & WeightsSpeech & AudioArchitecture Design (Transformers, SSMs, MoE)

Jun 11, 2025

AI InclusionJun 11, 2025

Ming-Omni: A Unified Multimodal Model for Perception and Generation

The paper introduces Ming-Omni, a unified multimodal model that processes images, text, audio, and video using modality-specific encoders and a Mixture-of-Experts (MoE) architecture called Ling with modality-specific routers. This architecture allows Ming-Omni to perform both perception and generation tasks across modalities without task-specific fine-tuning. The model achieves strong performance in speech and image generation through the integration of an advanced audio decoder and Ming-Lite-Uni, and is released as an open-source model matching GPT-4o in modality support.

Introduces a unified multimodal architecture, Ming-Omni, capable of both perception and generation across image, text, audio, and video modalities using a novel MoE routing mechanism.

AI Inclusion, Biao Gong, Cheng Zou +55292506.09344

Multimodal ModelsArchitecture Design (Transformers, SSMs, MoE)Speech & Audio

May 23, 2025

Research on 5G Wireless Communication Signal Processing Based on Transformer and GRU Optimization

This paper introduces a novel 5G signal processing method that combines a Transformer architecture for parallel computation and a GRU-optimized recurrent architecture to enhance accuracy and efficiency in handling time-varying channel conditions. The Transformer architecture improves computational efficiency, while the GRU layer optimizes convolutional layers for enhanced temporal dynamics capture. Experimental results demonstrate a 5 dB SNR improvement over traditional Decision Feedback Equalizers (DFE) while reducing computational complexity, even under severe inter-symbol interference and non-linear distortions.

Introduces a hybrid Transformer and GRU architecture for 5G signal processing that significantly improves both accuracy and computational efficiency compared to traditional DFE methods.

Tianrui Zhu, Chaowei Huang, Ziyue Tang

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & OptimizationSpeech & Audio

May 20, 2025

PAST: Phonetic-Acoustic Speech Tokenizer

The paper introduces PAST, a new end-to-end framework for speech tokenization that jointly models phonetic information and signal reconstruction without relying on external pre-trained models. PAST leverages supervised phonetic data through auxiliary tasks to directly integrate phonetic domain knowledge into the tokenization process. The framework, including a streamable variant, demonstrates superior performance in phonetic representation, speech reconstruction, and as a speech representation for speech language models compared to existing baseline tokenizers.

Introduces an end-to-end trainable speech tokenizer, PAST, that integrates phonetic information directly via supervised learning, eliminating the need for pre-trained self-supervised models.

Nadav Har-Tuv, Or Tal, Yossi Adi52505.14470

Speech & AudioNatural Language ProcessingArchitecture Design (Transformers, SSMs, MoE)

Apr 30, 2025

Shivajirao S Jondhale College of Engineering Dombivali EastApr 30, 2025

SPEAK PDF

This paper introduces SPEAK PDF, a system that converts PDF documents into audio format with integrated translation and summarization features. The system uses TTS technology for audio conversion, machine translation for multilingual support, and NLP techniques for text summarization. The key result is a tool designed to improve accessibility, promote multilingualism, and enhance productivity by enabling efficient comprehension of PDF documents.

Introduces a system integrating text-to-speech, machine translation, and NLP-based summarization to enhance PDF accessibility and comprehension.

Ayush Kamble

Natural Language ProcessingSpeech & Audio

Apr 7, 2025

P2Mark: Plug-and-play Parameter-level Watermarking for Neural Speech Generation

The paper introduces P2Mark, a novel plug-and-play parameter-level watermarking method for neural speech generation (NSG) models designed for open-source scenarios. P2Mark embeds watermarks directly into the model weights using a lightweight adapter during training, enabling pre-release watermark modification and post-release security. Experiments on vocoder and codec models demonstrate that P2Mark achieves comparable performance to existing audio watermarking techniques in terms of accuracy, imperceptibility, and robustness, while providing white-box protection.

Introduces a parameter-level watermarking technique, P2Mark, that embeds watermarks directly into the weights of neural speech generation models, enabling copyright protection in open-source settings.

Yong Ren, Jiangyan Yi, Tao Wang +72504.05197

Speech & AudioConstitutional AI & AI Ethics

Apr 3, 2025

Low-cost Embedded Breathing Rate Determination Using 802.15.4z IR-UWB Hardware for Remote Healthcare

The paper presents a low-cost, embedded system for determining breathing rates using an IEEE 802.15.4z IR-UWB radar and a custom CNN architecture. The CNN predicts breathing rates directly from UWB channel impulse response (CIR) data, outperforming rule-based and model-based methods with a mean absolute error of 1.73 BPM, further reduced to 0.84 BPM with calibration data. The study demonstrates the feasibility of deploying the quantized CNN on a low-cost nRF52840 SoC with minimal performance degradation, achieving long battery life for continuous remote healthcare monitoring.

Demonstrates a highly efficient and accurate embedded system for breathing rate determination using a CNN-based approach on low-cost IR-UWB hardware.

Anton Lambrecht, Stijn Luchie, Jaron Fontaine +32504.03772

Scientific Discovery & Drug DesignSpeech & AudioComputer Vision

Mar 6, 2025

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Audio Flamingo 2 (AF2) is introduced as an Audio-Language Model (ALM) that enhances audio understanding and reasoning by utilizing a custom CLAP model, synthetic Audio QA data, and a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance on over 20 benchmarks with a 3B parameter model, outperforming larger models. The work also introduces LongAudio, a new dataset for training ALMs on long audio segments (30 secs to 5 mins), and demonstrates exceptional performance on the LongAudioBench benchmark after fine-tuning AF2.

Introduces Audio Flamingo 2, an ALM with enhanced audio understanding and reasoning capabilities, and the LongAudio dataset and benchmark for long audio understanding.

Sreyan Ghosh, Zhifeng Kong, Sonal Kumar +7882503.03983

Multimodal ModelsSpeech & AudioReasoning & Chain-of-Thought

Mar 1, 2025

Generative AI in Multimodal Learning: Integration of Vision, Text and Audio for Advanced Human-Computer Interaction

This paper explores integrating generative AI into multimodal learning by fusing vision (CNNs), text (NLP), and audio (RNNs) data streams for enhanced human-computer interaction. The research demonstrates that this integrative approach improves interaction accuracy, system responsiveness, and user engagement compared to unimodal systems. The study also proposes adaptive weighting strategies and modular architectures to address challenges like cross-modal data alignment and computational demands.

Demonstrates the feasibility and benefits of a generative AI-driven multimodal learning framework that integrates vision, text, and audio for improved human-computer interaction.

Vinay Kumar Gali, Shubham Jain

Multimodal ModelsSpeech & AudioComputer Vision

Feb 18, 2025

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

The paper introduces SongGen, a single-stage auto-regressive transformer model for text-to-song generation that addresses limitations of multi-stage approaches. SongGen allows for fine-grained control over musical attributes like lyrics, instrumentation, genre, and timbre, and supports voice cloning via a reference clip. The model is trained with different token pattern strategies for mixed and dual-track output modes, and the authors demonstrate improved generation quality with their approach.

Introduces a novel single-stage auto-regressive transformer architecture, SongGen, for controllable text-to-song generation, enabling flexible output modes and fine-grained control over musical attributes.

Zihan Liu, Shuangrui Ding, Zhixiong Zhang +6232502.13128

Architecture Design (Transformers, SSMs, MoE)Open-Source Models & WeightsSpeech & Audio

Jan 3, 2025

MusicGen-Stem: Multi-stem music generation and edition through autoregressive modeling

The paper introduces MusicGen-Stem, a multi-stem autoregressive music generation model capable of generating and editing individual stems (bass, drums, other) and their mixtures. They train specialized compression algorithms for each stem to create parallel token streams and leverage music source separation techniques to train a multi-stream text-to-music language model on a large dataset. The model's conditioning method enables editing of individual stems in existing or generated songs, facilitating iterative composition.

Introduces a novel multi-stem autoregressive music generation model that allows for independent control and editing of individual stems (bass, drums, and other) within a musical composition.

Simon Rouard, Robin San Roman, Yossi Adi +1122501.01757

Speech & AudioArchitecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic DataMultimodal Models

Lattice is designed for desktop

Speech & Audio

Keywords

Top Labs in This Topic

Recent Papers

Lattice is designed for desktop

Speech & Audio

Keywords

Top Labs in This Topic

Recent Papers

Search