Speech & Audio
ApplicationsSpeech recognition, text-to-speech, audio generation, music AI, and spoken language understanding.
Keywords
Top Labs in This Topic
Recent Papers
The paper introduces Moonshine v2, an ergodic streaming encoder ASR model designed for latency-critical speech applications, particularly on resource-constrained edge devices. It addresses the latency issues of full-attention Transformer encoders by employing sliding-window self-attention, enabling bounded, low-latency inference while maintaining strong local context. Experiments demonstrate that Moonshine v2 achieves state-of-the-art word error rates on standard benchmarks, matching the accuracy of models six times larger while running significantly faster.
Introduces an ergodic streaming encoder ASR model, Moonshine v2, that uses sliding-window self-attention to achieve low-latency and high accuracy for on-device speech recognition.
The paper introduces WavBench, a new benchmark for end-to-end spoken dialogue models that evaluates reasoning, colloquialism, and paralinguistics, addressing limitations of existing text-centric benchmarks. WavBench comprises three subsets: Pro (reasoning), Basic (colloquialism), and Acoustic (paralinguistics), designed to assess complex problem-solving, natural language fluency, and nuanced understanding/generation of acoustic cues. Evaluation of five state-of-the-art models using WavBench reveals critical insights into model performance across these dimensions, highlighting areas for improvement in building more robust spoken dialogue agents.
Introduces WavBench, a novel benchmark dataset and evaluation toolkit designed to comprehensively assess reasoning, colloquialism, and paralinguistic capabilities in end-to-end spoken dialogue models.
The paper introduces Cross-Modal Robustness Transfer (CMRT) to improve the robustness of End-to-End Speech Translation (E2E-ST) models against morphological variations. CMRT leverages adversarial training in the text modality to transfer robustness to the speech modality, eliminating the need for computationally expensive adversarial speech data generation. Experiments across four language pairs show that CMRT improves adversarial robustness by over 3 BLEU points compared to baseline E2E-ST models.
Introduces Cross-Modal Robustness Transfer (CMRT), a novel framework for enhancing E2E-ST model robustness by transferring adversarial robustness from text to speech.
The paper investigates modality arbitration in Audio-LLMs, revealing a strong bias towards text over audio when the two modalities conflict, even when audio quality is superior. Using the ALME benchmark, the authors demonstrate that Gemini 2.0 Flash exhibits significantly higher text dominance in audio-text conflicts compared to text-text conflicts. They propose that this text dominance arises from an asymmetry in arbitration accessibility rather than information content, and provide evidence through interventions like forced transcription and fine-tuning ablations.
Reveals and analyzes a significant text dominance bias in audio-LLMs during modality arbitration, attributing it to differences in representational accessibility rather than information content.
The paper investigates the failure of speech recognition models on transcribing U.S. street names, finding a 44% error rate across 15 models from major vendors and disproportionately larger routing distance errors for non-English primary speakers. It highlights the gap between benchmark performance and real-world reliability, particularly for high-stakes tasks involving named entities. The authors then demonstrate that fine-tuning with a small, synthetically generated dataset of diverse pronunciations improves street name transcription accuracy by nearly 60% for non-English primary speakers.
Demonstrates that speech recognition models exhibit significant transcription errors on street names, particularly impacting non-English speakers, and mitigates this issue through synthetic data augmentation.
The paper introduces DreamID-Omni, a unified framework for human-centric audio-video generation, addressing tasks like reference-based generation, video editing, and audio-driven animation within a single model. It tackles the challenge of disentangling character identities and voice timbres by employing a Dual-Level Disentanglement strategy and a Symmetric Conditional Diffusion Transformer. Experimental results demonstrate state-of-the-art performance in video, audio, and audio-visual consistency, surpassing even proprietary commercial models.
Introduces a unified framework, DreamID-Omni, that achieves state-of-the-art performance on a range of human-centric audio-video generation tasks by disentangling identity and timbre control.
The paper introduces WaveFormer, a transformer architecture tailored for biomedical signal classification, addressing limitations of standard transformers in capturing multi-scale frequency patterns in long sequences. WaveFormer incorporates wavelet decomposition in both the embedding construction via multi-channel DWT and positional encoding via Dynamic Wavelet Positional Encoding (DyWPE). Experiments across eight datasets for human activity recognition and brain signal analysis demonstrate WaveFormer's competitive performance by effectively integrating frequency-domain information.
Introduces a novel transformer architecture, WaveFormer, that integrates wavelet decomposition into both the embedding and positional encoding stages to improve biomedical signal classification.
The paper introduces A$^{2}$V-SLP, an alignment-aware variational framework for sign language production that learns disentangled latent distributions for each articulator. This approach uses a disentangled VAE to encode sign pose sequences and extract articulator-specific mean and variance vectors, which then serve as distributional supervision for a non-autoregressive Transformer that predicts latent means and log-variances from text embeddings. By employing stochastic sampling and a gloss attention mechanism, A$^{2}$V-SLP achieves state-of-the-art back-translation performance and enhances motion realism in gloss-free sign language production.
Introduces an alignment-aware variational framework (A$^{2}$V-SLP) that learns disentangled latent distributions for sign language production, improving back-translation performance and motion realism.
The paper introduces Trans-Chunk BiMamba (TC-BiMamba), a novel architecture for unified streaming and non-streaming automatic speech recognition (ASR) that addresses the limitations of existing BiMamba-based streaming methods which are restricted to fixed chunk sizes. TC-BiMamba employs a trans-chunk mechanism to train bidirectional sequences offline with dynamic chunk sizes, enabling a single model to handle both offline and streaming decoding with varying latency requirements. Experiments demonstrate that TC-BiMamba achieves a 1.3x training speedup, reduces memory consumption by 50%, and improves ASR performance compared to chunk-wise processing, while also outperforming U2++ and matching LC-BiMamba with a smaller model size.
Introduces the Trans-Chunk BiMamba (TC-BiMamba) architecture, enabling efficient dynamic chunk size training for unified streaming and non-streaming ASR.
This paper introduces the concept of "musical metamerism," analogous to visual metamerism, where dissimilar audio waveforms produce similar auditory sensations. The authors present a method for generating musical metamers from audio recordings using joint time-frequency scattering (JTFS) implemented in the Kymatio Python library. The method's key advantage is its lack of reliance on manual preprocessing steps like transcription or source separation.
Introduces a novel method for generating musical metamers using joint time-frequency scattering, eliminating the need for manual audio preprocessing.
The paper introduces SLD-L2S, a novel lip-to-speech (L2S) framework based on a hierarchical subspace latent diffusion model that directly maps visual lip movements to the continuous latent space of a pre-trained neural audio codec, bypassing intermediate representations. The method employs a hierarchical architecture with parallel subspaces and a diffusion convolution block (DiCB) to enhance interactions within and between subspaces. By using reparameterized flow matching, the framework incorporates speech language model (SLM) and semantic losses during training, leading to state-of-the-art generation quality on benchmark datasets.
Introduces a hierarchical subspace latent diffusion model (SLD-L2S) for lip-to-speech synthesis that directly maps visual lip movements to the continuous latent space of a pre-trained neural audio codec, enabling the incorporation of SLM and semantic losses via reparameterized flow matching.
This paper investigates the impact of explicit frequency-domain feature modeling on HRTF magnitude upsampling from sparse measurements, comparing various architectures including MLPs, CNNs, dilated CNNs, and attention-based models. The authors find that explicitly modeling spectral dependencies consistently improves reconstruction accuracy, especially under severe sparsity. They propose a frequency-domain Conformer-based architecture to capture both local spectral continuity and long-range frequency correlations, achieving state-of-the-art performance on the SONICOM and HUTUBS datasets.
Demonstrates the importance of explicit frequency-domain feature modeling for HRTF magnitude upsampling and introduces a Conformer-based architecture that leverages both local and long-range spectral dependencies.
The paper introduces audio-interleaved reasoning for Large Audio Language Models (LALMs) to overcome the information bottleneck of one-time audio encoding. They propose a two-stage training framework involving supervised fine-tuning for salient audio segment localization and reinforcement learning to encourage re-listening. The resulting LALM, Echo, demonstrates improved performance on audio comprehension benchmarks, showcasing the benefits of dynamic audio re-listening during reasoning.
Introduces and validates audio-interleaved reasoning, enabling LALMs to actively re-listen to audio during the reasoning process, thereby improving audio comprehension.
This paper investigates the internal representations of high-level musical concepts within audio diffusion models using activation patching, revealing that a small subset of attention layers controls distinct semantic concepts. They then use Contrastive Activation Addition and Sparse Autoencoders in these key layers to achieve more precise control over audio generation. The authors demonstrate the ability to manipulate specific musical elements like tempo and mood by steering activations in the identified layers.
Demonstrates precise control over generated audio by identifying and steering activations in specific attention layers of audio diffusion models.
The authors introduce CAT (Causal Audio Tokenizer with Transformer), a fully Transformer-based architecture for end-to-end learning of discrete audio tokenizers, jointly optimizing the encoder, quantizer, and decoder. They then scale CAT to create MOSS-Audio-Tokenizer, a 1.6B parameter model pre-trained on 3M hours of audio data. Results demonstrate that MOSS-Audio-Tokenizer achieves state-of-the-art audio reconstruction across speech, sound, and music, and enables competitive ASR and autoregressive TTS performance.
Introduces and validates a scalable, fully end-to-end Transformer architecture (CAT) for discrete audio tokenization that outperforms existing codecs and enables downstream audio tasks.
The paper introduces Voxtral Realtime, a novel automatic speech recognition (ASR) model designed for native streaming with sub-second latency. Unlike chunking-based approaches, Voxtral Realtime is trained end-to-end for streaming with explicit audio-text alignment, leveraging the Delayed Streams Modeling framework. The model incorporates a new causal audio encoder and Ada RMS-Norm for improved delay conditioning, and achieves performance comparable to Whisper at a 480ms delay after large-scale pretraining across 13 languages.
Presents Voxtral Realtime, a natively streaming ASR model that matches offline transcription quality at sub-second latency through end-to-end training and explicit audio-text stream alignment.
The paper introduces Stemphonic, a diffusion/flow-based framework for generating a variable set of synchronized music stems in a single inference pass, addressing the limitations of existing parallel or sequential stem generation methods. Stemphonic achieves this by treating each stem as a batch element during training, grouping synchronized stems, and applying a shared noise latent to each group, enabling efficient multi-stem generation conditioned on stem-specific text inputs. Experiments on open-source stem evaluation sets demonstrate that Stemphonic generates higher-quality outputs and accelerates full mix generation by 25-50%.
Introduces a novel diffusion/flow-based architecture, Stemphonic, that generates a variable number of synchronized music stems in a single pass, improving both generation speed and output quality compared to existing methods.
The paper introduces NarraScore, a hierarchical framework for generating soundtracks for long-form videos by leveraging emotion as a compressed representation of narrative logic. It uses frozen Vision-Language Models (VLMs) to extract Valence-Arousal trajectories from video and employs a Dual-Branch Injection strategy, consisting of a Global Semantic Anchor and a Token-Level Affective Adapter, to control musical dynamics. Experiments show that NarraScore achieves state-of-the-art consistency and narrative alignment with minimal computational cost.
Introduces a hierarchical framework, NarraScore, that leverages VLMs and a dual-branch injection strategy to generate narrative-aligned soundtracks for long-form videos.
The paper introduces MOVA, an open-source Mixture-of-Experts (MoE) model with 32B parameters (18B active) designed for synchronized video and audio generation. MOVA addresses the limitations of cascaded pipelines and closed-source systems by enabling simultaneous generation of high-quality audio-visual content, including lip-synced speech and environment-aware sound effects. The model supports Image-Text to Video-Audio (IT2VA) generation and is released with code for efficient inference, LoRA fine-tuning, and prompt enhancement.
Introduces MOVA, an open-source MoE model for synchronized video and audio generation, facilitating research and development in joint multimodal modeling.
This paper surveys the evolution of Emotion Recognition in Video (ERV) systems, tracing the shift from handcrafted features and task-specific deep learning models to transformer-based vision-language models and multimodal large language models (MLLMs). It analyzes multimodal fusion strategies, dataset characteristics, and evaluation protocols, while highlighting limitations related to robustness, bias, and annotation quality. The review compares task-specific models with foundation model approaches, clarifying their strengths and weaknesses for different application contexts, and outlines future research directions for robust and efficient ERV systems.
Systematically reviews and compares the evolution of emotion recognition in video, contrasting classical approaches with emerging multimodal large language model-based methods, and identifies key challenges for real-world deployment.
The authors introduce Muse, an open-source system for long-form song generation with fine-grained style conditioning, addressing the lack of reproducibility in academic research due to unavailable training data. They release a dataset of 116k fully licensed synthetic songs with lyrics and style descriptions paired with SunoV5-synthesized audio. Muse, a Qwen-based language model finetuned with discrete audio tokens, achieves competitive performance in phoneme error rate, text-music style similarity, and audio aesthetic quality, demonstrating controllable segment-level generation.
Releases Muse, a fully open-source system for long-form song generation, along with a licensed synthetic dataset and training/evaluation pipelines, to enable reproducible research.
LTX-2, a new open-source foundational model, generates high-quality, temporally synchronized audiovisual content by employing an asymmetric dual-stream transformer architecture with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers. The model incorporates temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning, along with a multilingual text encoder and modality-aware classifier-free guidance (modality-CFG) to improve audiovisual alignment and controllability. Evaluations demonstrate that LTX-2 achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, rivaling proprietary models with significantly reduced computational cost and inference time.
Introduces LTX-2, an efficient open-source audiovisual foundation model that achieves state-of-the-art quality with improved alignment and controllability through an asymmetric dual-stream transformer architecture and modality-aware classifier-free guidance.
The paper introduces AV-SpeakerBench, a new benchmark designed to evaluate fine-grained audiovisual reasoning capabilities of MLLMs, specifically focusing on understanding human speech in videos. The benchmark consists of 3,212 multiple-choice questions centered on speaker-centric reasoning, requiring models to align who speaks, what is said, and when it occurs. Experiments demonstrate that the Gemini family of models outperforms open-source alternatives, with Gemini 2.5 Pro achieving the highest accuracy, while highlighting the challenges open models face in audiovisual fusion.
Introduces AV-SpeakerBench, a novel benchmark for evaluating speaker-centric audiovisual reasoning in MLLMs, designed to assess fine-grained understanding of human speech in real-world videos.
The paper introduces TwinVoice, a multi-dimensional benchmark designed to evaluate the persona simulation capabilities of Large Language Models (LLMs) across social, interpersonal, and narrative contexts. TwinVoice decomposes persona simulation into six fundamental capabilities, including opinion consistency, memory recall, and syntactic style, providing a granular assessment framework. Experiments using TwinVoice reveal that while LLMs demonstrate moderate accuracy, they significantly underperform in areas like syntactic style and memory recall compared to human baselines, highlighting areas for future research.
Introduces TwinVoice, a novel benchmark for evaluating LLM-based persona simulation across diverse real-world contexts and decomposed into fundamental capabilities.
The paper introduces MMedFD, a new real-world Chinese healthcare ASR dataset designed for multi-turn, full-duplex conversations, addressing the scarcity of open benchmarks for clinical dialogue ASR. The dataset contains 5,805 annotated sessions with synchronized user and mixed-channel audio, along with RTTM/CTM timing and role labels. The authors also present a model-agnostic pipeline for streaming segmentation, speaker attribution, and dialogue memory, and fine-tune Whisper-small on role-concatenated audio, providing a reproducible benchmark for streaming ASR and end-to-end duplex agents.
Introduces MMedFD, the first publicly available real-world Chinese healthcare ASR corpus for multi-turn, full-duplex scenarios, complete with annotations, evaluation metrics, and a baseline pipeline.
The paper introduces FlexSED, an open-vocabulary sound event detection system that addresses the limitations of traditional multi-class SED frameworks by enabling free-text sound queries and improving zero/few-shot learning. FlexSED leverages a pretrained audio SSL model and the CLAP text encoder, incorporating an encoder-decoder architecture and adaptive fusion for continuous training. By using LLMs to refine event query selection for training, FlexSED achieves state-of-the-art performance on AudioSet-Strong and exhibits strong zero-shot and few-shot capabilities.
Introduces an open-vocabulary sound event detection system, FlexSED, which effectively integrates pretrained audio and text encoders with LLM-assisted training to achieve superior performance and generalization.
This paper introduces a LangGraph-based multimodal video understanding agent that leverages a state-machine architecture for intelligent routing and multi-task processing of video, audio, and subtitles. They fine-tune the Qwen2.5-VL model on a newly constructed multimodal video understanding dataset, which encompasses perception, temporal reasoning, and higher-level inference tasks. Results show significant performance improvements on video question answering and temporal localization, demonstrating the benefits of multi-task learning and intelligent routing.
Introduces a novel LangGraph-based agent architecture that intelligently routes and processes multimodal video data using a state-machine and multi-task fine-tuned VLM.
This paper introduces a large language model-based digital patient (LLMDP) system that converts de-identified electronic health records into interactive, voice-enabled virtual patients for ophthalmology training. The LLMDP system, built upon a retrieval-augmented framework, allows for free-text dialogue and adaptive feedback. A randomized controlled trial (N=84) demonstrated that students trained with the LLMDP system significantly improved their medical history-taking assessment scores and empathy compared to those using traditional methods, suggesting a scalable and effective approach to medical education.
Demonstrates that a large language model-based digital patient system significantly enhances medical history-taking skills and empathy in ophthalmology trainees compared to traditional training methods.
The authors introduce MERaLiON-AudioLLM, a large language model trained on 62 million multimodal instruction samples (260k hours of audio) to understand Singlish and perform diverse audio-based tasks. This model addresses the gap in region-specific AI capable of understanding colloquial and code-switched language. MERaLiON-AudioLLM demonstrates competitive performance in ASR, spoken question answering, speech translation, and paralinguistic analysis, particularly excelling in local speech recognition compared to existing open-source models.
Presents MERaLiON-AudioLLM, the first general-purpose, multitask audio-based LLM designed to understand Singlish.
This survey reviews the integration of large language models (LLMs) into robotic systems, focusing on locomotion, navigation, manipulation, and voice interaction. It highlights how LLMs translate natural language commands into robot actions, enable semantic planning, and support adaptive execution using frameworks like SayTap, TrustNavGPT, MapGPT, and 3D-LOTUS++. The review also covers training methodologies, benchmark datasets, and deployment architectures for bridging the sim-to-real gap and achieving cross-embodiment generalization in LLM-enhanced robotic systems.
Synthesizes a comprehensive overview of current approaches for integrating LLMs into robotic autonomy, categorizing methods by application area (locomotion, navigation, manipulation, voice) and highlighting key frameworks, datasets, and deployment considerations.
This paper addresses the underexplored NLP tasks of structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations, which are critical for reducing healthcare provider documentation burden. The authors evaluate the performance of both open- and closed-weight LLMs on these tasks using private and newly released open-source datasets (SYNUR and SIMORD). They also propose an agentic pipeline for generating realistic, non-sensitive nurse dictations to facilitate structured extraction of clinical observations.
Introduces SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction, respectively, and evaluates LLM performance on these tasks.
The authors introduce OpenS2S, a fully open-source end-to-end large speech language model (LSLM) for empathetic speech interactions. OpenS2S builds upon the BLSP-Emo model and uses a streaming interleaved decoding architecture for low-latency speech generation. To enable end-to-end training, they develop an automated data construction pipeline that leverages LLMs and controllable TTS systems to synthesize a diverse and scalable empathetic speech dialogue corpus.
Introduces OpenS2S, a fully open-source end-to-end LSLM, along with a novel automated data construction pipeline for generating empathetic speech dialogues, to promote transparent research in empathetic speech systems.
The authors introduce OMAR-RQ, an open-source music audio representation model trained on a large dataset (330k+ hours) using self-supervision via masked token prediction. The model explores different input features and quantization strategies to learn general-purpose music representations. OMAR-RQ achieves state-of-the-art performance among open self-supervised models on a suite of music understanding tasks, including tagging, pitch estimation, and beat tracking.
Introduces OMAR-RQ, a novel open-source self-supervised model for music audio representation trained with multi-feature masked token prediction, demonstrating superior performance compared to existing open models across various MIR tasks.
The paper introduces Ming-Omni, a unified multimodal model that processes images, text, audio, and video using modality-specific encoders and a Mixture-of-Experts (MoE) architecture called Ling with modality-specific routers. This architecture allows Ming-Omni to perform both perception and generation tasks across modalities without task-specific fine-tuning. The model achieves strong performance in speech and image generation through the integration of an advanced audio decoder and Ming-Lite-Uni, and is released as an open-source model matching GPT-4o in modality support.
Introduces a unified multimodal architecture, Ming-Omni, capable of both perception and generation across image, text, audio, and video modalities using a novel MoE routing mechanism.
This paper introduces a novel 5G signal processing method that combines a Transformer architecture for parallel computation and a GRU-optimized recurrent architecture to enhance accuracy and efficiency in handling time-varying channel conditions. The Transformer architecture improves computational efficiency, while the GRU layer optimizes convolutional layers for enhanced temporal dynamics capture. Experimental results demonstrate a 5 dB SNR improvement over traditional Decision Feedback Equalizers (DFE) while reducing computational complexity, even under severe inter-symbol interference and non-linear distortions.
Introduces a hybrid Transformer and GRU architecture for 5G signal processing that significantly improves both accuracy and computational efficiency compared to traditional DFE methods.
The paper introduces PAST, a new end-to-end framework for speech tokenization that jointly models phonetic information and signal reconstruction without relying on external pre-trained models. PAST leverages supervised phonetic data through auxiliary tasks to directly integrate phonetic domain knowledge into the tokenization process. The framework, including a streamable variant, demonstrates superior performance in phonetic representation, speech reconstruction, and as a speech representation for speech language models compared to existing baseline tokenizers.
Introduces an end-to-end trainable speech tokenizer, PAST, that integrates phonetic information directly via supervised learning, eliminating the need for pre-trained self-supervised models.
This paper introduces SPEAK PDF, a system that converts PDF documents into audio format with integrated translation and summarization features. The system uses TTS technology for audio conversion, machine translation for multilingual support, and NLP techniques for text summarization. The key result is a tool designed to improve accessibility, promote multilingualism, and enhance productivity by enabling efficient comprehension of PDF documents.
Introduces a system integrating text-to-speech, machine translation, and NLP-based summarization to enhance PDF accessibility and comprehension.
The paper introduces P2Mark, a novel plug-and-play parameter-level watermarking method for neural speech generation (NSG) models designed for open-source scenarios. P2Mark embeds watermarks directly into the model weights using a lightweight adapter during training, enabling pre-release watermark modification and post-release security. Experiments on vocoder and codec models demonstrate that P2Mark achieves comparable performance to existing audio watermarking techniques in terms of accuracy, imperceptibility, and robustness, while providing white-box protection.
Introduces a parameter-level watermarking technique, P2Mark, that embeds watermarks directly into the weights of neural speech generation models, enabling copyright protection in open-source settings.
The paper presents a low-cost, embedded system for determining breathing rates using an IEEE 802.15.4z IR-UWB radar and a custom CNN architecture. The CNN predicts breathing rates directly from UWB channel impulse response (CIR) data, outperforming rule-based and model-based methods with a mean absolute error of 1.73 BPM, further reduced to 0.84 BPM with calibration data. The study demonstrates the feasibility of deploying the quantized CNN on a low-cost nRF52840 SoC with minimal performance degradation, achieving long battery life for continuous remote healthcare monitoring.
Demonstrates a highly efficient and accurate embedded system for breathing rate determination using a CNN-based approach on low-cost IR-UWB hardware.
Audio Flamingo 2 (AF2) is introduced as an Audio-Language Model (ALM) that enhances audio understanding and reasoning by utilizing a custom CLAP model, synthetic Audio QA data, and a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance on over 20 benchmarks with a 3B parameter model, outperforming larger models. The work also introduces LongAudio, a new dataset for training ALMs on long audio segments (30 secs to 5 mins), and demonstrates exceptional performance on the LongAudioBench benchmark after fine-tuning AF2.
Introduces Audio Flamingo 2, an ALM with enhanced audio understanding and reasoning capabilities, and the LongAudio dataset and benchmark for long audio understanding.
This paper explores integrating generative AI into multimodal learning by fusing vision (CNNs), text (NLP), and audio (RNNs) data streams for enhanced human-computer interaction. The research demonstrates that this integrative approach improves interaction accuracy, system responsiveness, and user engagement compared to unimodal systems. The study also proposes adaptive weighting strategies and modular architectures to address challenges like cross-modal data alignment and computational demands.
Demonstrates the feasibility and benefits of a generative AI-driven multimodal learning framework that integrates vision, text, and audio for improved human-computer interaction.
The paper introduces SongGen, a single-stage auto-regressive transformer model for text-to-song generation that addresses limitations of multi-stage approaches. SongGen allows for fine-grained control over musical attributes like lyrics, instrumentation, genre, and timbre, and supports voice cloning via a reference clip. The model is trained with different token pattern strategies for mixed and dual-track output modes, and the authors demonstrate improved generation quality with their approach.
Introduces a novel single-stage auto-regressive transformer architecture, SongGen, for controllable text-to-song generation, enabling flexible output modes and fine-grained control over musical attributes.
The paper introduces MusicGen-Stem, a multi-stem autoregressive music generation model capable of generating and editing individual stems (bass, drums, other) and their mixtures. They train specialized compression algorithms for each stem to create parallel token streams and leverage music source separation techniques to train a multi-stream text-to-music language model on a large dataset. The model's conditioning method enables editing of individual stems in existing or generated songs, facilitating iterative composition.
Introduces a novel multi-stem autoregressive music generation model that allows for independent control and editing of individual stems (bass, drums, and other) within a musical composition.

