Search papers, labs, and topics across Lattice.
100 papers published across 5 labs.
Cepstral smoothing can significantly reduce musical noise artifacts in blind source separation of speech mixtures.
The optimal spectrogram configuration for audio and speech analysis hinges on a nuanced interplay between front-end feature representation and back-end classifier architecture, varying significantly across tasks.
Quantifying the divergence between real and synthetic phoneme distributions via Kullback-Leibler divergence can pinpoint the most vulnerable phonemes for detecting audio deepfakes.
Achieve controllable and scalable speech generation with MOSS-TTS, enabling zero-shot voice cloning and long-form synthesis.
LLMs can extract consistent, multidimensional semantic information directly from the phonological structure of language, revealing a non-arbitrary relationship between sound and meaning.
Achieve controllable and scalable speech generation with MOSS-TTS, enabling zero-shot voice cloning and long-form synthesis.
LLMs can extract consistent, multidimensional semantic information directly from the phonological structure of language, revealing a non-arbitrary relationship between sound and meaning.
Spotify's GLIDE model proves that generative LLMs can drive significant gains in podcast discovery and non-habitual listening in a real-world, production environment.
Counterintuitively, better speech recognition unlocks surprisingly accurate Alzheimer's detection from simple text analysis, outperforming more complex acoustic models.
Stop struggling with the stability-plasticity dilemma in multilingual Speech-LLMs: Zipper-LoRA dynamically disentangles LoRA updates to boost low-resource ASR without sacrificing cross-lingual transfer.
Achieve single-pass alignment of multi-talker speech – a feat previously impossible – by modeling overlaps as shuffles.
Finally, a unified framework lets you control both facial appearance and voice timbre for personalized audio-video generation across multiple identities.
Interactive avatars can now exhibit more emotionally appropriate and contextually aware facial behaviors thanks to a novel architecture that disentangles audio-driven lip movements from user-driven non-lip facial expressions.
Adversarial training can effectively disentangle session-specific noise from task-relevant speech features in brain-computer interfaces, leading to more robust decoding across recording sessions.
Grounding LALM reasoning in diverse, reliability-weighted acoustic evidence blows away the competition in Audio Question Answering, proving that verifiable chains beat black boxes.
Sound source localization gets a reliability upgrade: conformal prediction delivers uncertainty estimates, even when you don't know how many speakers are talking.
Mimicking human cognition, FLAIR lets dialogue models "think while listening," boosting performance without adding latency.
Pre-training on nasal vs. oral context lets a simple model beat large pre-trained speech models at detecting speech disorders in noisy, real-world settings.
Even when visual data is missing or noisy, EgoAdapt accurately determines who is talking to the camera wearer by adaptively integrating head orientation, lip movement, and robust audio features.
Acoustic and phonetic NACs encode accent in fundamentally different ways, with implications for how we interpret and manipulate these representations.
Control the emotional tone of generated speech without any training by directly manipulating specific neurons within large audio-language models.
Imagine seeing your tongue move in real-time based on the sounds you make – AURORA brings that closer to reality.
Audio backdoor attacks leave a tell: triggers are surprisingly stable to destructive noise but fragile to meaning-preserving changes.
By explicitly modeling cardiac pathology, this ECG reconstruction method achieves a 76% reduction in error compared to existing techniques, promising more accurate diagnoses from portable devices.
Oral exams, previously impossible to scale, can now be delivered for pennies using voice AI, but controlling LLM behavior requires architectural guardrails, not just clever prompts.
Jointly training audio watermarking and source separation unlocks robust multi-stream watermarking, enabling independent tracking of individual audio components within a mix.
Ditch the separate models: CAST-TTS uses a single cross-attention mechanism to control TTS timbre from both speech and text, rivaling specialized models in quality.
Forget one-hot encodings: conditioning timbre VAEs on continuous perceptual features unlocks more compact and controllable latent spaces.
By forcing a model to reconstruct aggressively masked EEG spectrograms, SpecMoE learns intricate neural patterns across both high- and low-frequency domains, leading to state-of-the-art cross-species EEG decoding.
PyPhonPlan offers a new open-source toolkit to simulate speech dynamics with neurally-grounded representations, enabling researchers to model interactive speech production and perception loops.
ASR-assisted transcription doesn't automatically improve accuracy in corpus creation, and its effectiveness hinges on factors like workflow design and transcriber expertise.
Unlock timbre-aware generative AI with a new dataset linking semantic descriptors to electric guitar sounds, enabling nuanced control over audio synthesis.
Unfolding the EM algorithm into a neural network yields a speaker localization method that's more robust and accurate than traditional Batch-EM, especially in challenging acoustic conditions.
A shared encoder for targeted sound detection leaps past prior art, achieving a new state-of-the-art F1 score of 83.15% on URBAN-SED while simplifying the model architecture.
Stealthier over-the-air adversarial attacks on speech recognition are possible, but require careful balancing of audibility and effectiveness.
SER models, often assumed to generalize well to synthesized speech, actually fail miserably, revealing their reliance on spurious correlations rather than genuine emotional understanding.
Ditch DOA estimation: this new target speaker extraction method uses HRTFs to preserve spatial audio cues and boost speech quality.
A new spoken user simulator, SpokenUS, trained on a large-scale dataset, finally captures the messiness of real human conversation, including barge-ins and disfluencies, to better train dialogue agents.
Current Omni-modal LLMs can ace perception tasks but still fail at basic social interactions like knowing when and how to jump into a conversation.
A new smartphone protocol enables large-scale, privacy-preserving collection of prosodic speech data in the wild, opening doors to studying the subtle emotional nuances in everyday communication.
SpeechLLMs can be made significantly faster and more accurate at question answering by explicitly training their attention mechanisms to focus on relevant evidence.
OmniSONAR halves cross-lingual search error on FLORES and reduces error by 15x on BIBLE, proving that truly universal sentence embeddings across thousands of languages and modalities are now within reach.
Get competitive multilingual ASR performance with 6x smaller models and 200x less training cost by using balanced fine-tuning and implicit language learning.
Robots can now use real-time environmental sounds to guide manipulation tasks, thanks to a new framework that overcomes the "Blind Execution Interval" of traditional vision-language-action models.
Speaker diarization in movies and TV shows just got a whole lot better, thanks to a new multimodal framework that uses visual cues, speech, and subtitles to handle the chaos of open-world video.
Forget painstakingly aligning audio and video – this diffusion model learns to generate them jointly, opening the door to more realistic and immersive multimodal experiences.
Forget static domain priors: the best way to rate AI-generated audio quality depends on *which* aspect of quality you're measuring.
An agentic framework slashes entity recognition errors in ASR by up to 46% by intelligently combining multiple ASR hypotheses and constrained LLM correction.
Recovering synthesizer parameters directly from audio is now possible with Instrumental, a system that combines a differentiable synthesizer with evolutionary optimization, opening new avenues for timbral analysis and manipulation.
The optimal spectrogram configuration for audio and speech analysis hinges on a nuanced interplay between front-end feature representation and back-end classifier architecture, varying significantly across tasks.
By using text as an anchor, this model achieves state-of-the-art emotional mimicry intensity estimation, even when visual and acoustic data are noisy or missing.
A 97% accurate Romansh idiom classifier unlocks idiom-aware NLP tools for a low-resource language.
Speech enhancement doesn't always improve audio deepfake detection; in fact, algorithms that *reduce* perceptual speech quality can paradoxically lead to better spoof detection in noisy environments.
A new 320-hour corpus of French speech reveals how pronunciation has changed over six decades, including the surprising finding that voice pitch evolution doesn't differ by gender.
Efficient attention mechanisms like RetNet and LightNet can speed up Speech Emotion Recognition by an order of magnitude, but at the cost of some accuracy compared to standard self-attention.
Quantifying the divergence between real and synthetic phoneme distributions via Kullback-Leibler divergence can pinpoint the most vulnerable phonemes for detecting audio deepfakes.
By shifting the learning objective from direct spectral mapping to filter estimation based on inter-frame correlations, IF-CorrNet achieves state-of-the-art monaural speech dereverberation performance, particularly in real-world environments where generalization is critical.
MLLMs still can't handle time-sensitive multimodal reasoning, often failing to integrate auditory and visual cues effectively in dynamic environments like a 4D escape room.
Forget simply bolting on an LLM: this work reveals the surprisingly intricate dance between acoustic models and LLMs needed to unlock state-of-the-art speech recognition.
Finally, realistic and diverse listener reactions to speech can be automatically generated, moving beyond simple retrieval or LLM-driven approaches.
For live music performances, this work achieves zero-latency automatic music mixing using deep learning, a feat previously unachieved due to the challenges of acoustic bleed and synchronization constraints.
Current reward models for spoken dialogue systems are missing crucial paralinguistic and natural speech elements, but this new model closes the gap by operating directly on speech and outperforming existing audio LLMs.
Forget expensive ECG hardware: this dataset and benchmark show you can reconstruct clinically useful chest-lead ECGs from cheap vibrational sensors, but watch out for "hallucinated" heartbeats.
Achieve human-like full-duplex voice interactions with SoulX-Duplug, a plug-and-play module that slashes latency and improves turn management by acting as a semantic VAD.
Standardized evaluation of nonverbal vocalizations in TTS is now possible with NV-Bench, a new benchmark that treats NVs as communicative acts, not just acoustic artifacts.
Ditch hand-tuned beamformer combinations: a neural network with cross-attention learns spectrally coherent weights for improved target source extraction in noisy audio mixtures.
Overcome the scarcity of labeled data in dysarthric speech quality assessment with a novel data augmentation framework that leverages unlabeled data and outperforms state-of-the-art methods.
Ditch the text prompts: AC-Foley uses reference audio to synthesize video sound effects with unprecedented control, enabling timbre transfer and zero-shot generation.
Speech LLMs can now better understand your emotions: a new RL approach boosts paralinguistic understanding by 8-12% over state-of-the-art models.
Personalizing ASR for atypical speech gets a boost: pre-training on multi-speaker atypical data before speaker-specific fine-tuning significantly improves performance.
Rivaling English's GigaSpeech in scale, TAGARELA unlocks the potential for state-of-the-art Portuguese speech models with its nearly 9,000 hours of podcast audio.
Prompt engineering can significantly enhance ChatGPT's ability to provide balanced feedback and emotional support in ESL speaking practice, though culturally responsive teaching remains a challenge.
A new pipeline turns noisy, inconsistent open-source data into a 500-hour, high-quality Vietnamese ASR dataset, finally giving researchers a solid base for building better speech recognition.
Unleashing realistic 3D talking heads on *any* face scan, FreeTalk breaks free from template meshes and rigid topologies, even capturing nuanced emotional expressions.
Existing target speech extraction models falter when speech overlap varies, exhibiting suppression or residual interference, but VorTEX maintains high separation fidelity across a wide range of overlap ratios.
A sequential CNN-RNN architecture achieves 84% accuracy in classifying eight Nepali music genres, substantially outperforming classical machine learning methods and other deep learning architectures on a newly constructed dataset.
Cepstral smoothing can significantly reduce musical noise artifacts in blind source separation of speech mixtures.
A new synthetic whispered speech corpus, WhispSynth, closes the data gap in text-to-whisper research by achieving naturalness scores on par with real recordings.
Current audio-language models are culturally tone-deaf: they can't even detect Persian poetry meter, despite crushing English speech tasks.
Text-derived "nudges" can steer the reasoning of speech-based AI models, boosting accuracy by up to 4.4% without any training.
Achieve accent normalization with interpretable and controllable accent strength by selectively reusing self-supervised speech tokens via masked discrete diffusion.
Persian poets exhibit distinct phonetic signatures that transcend meter and individual style, evolving across centuries with shifts in genre and literary context.
A flute-playing robot achieves automated fingering and register-dependent embouchure assistance without requiring human embouchure control, opening new avenues for musical instrument automation.
Injecting nonverbal cues like laughter and sighs into speech synthesis is now more expressive and natural, thanks to a novel training strategy that overcomes data scarcity.
Accented speech reveals perceptual biases in speech synthesis evaluation: listeners rate speakers with matching accents as more natural.
Achieve more natural and synchronized video dubbing by conditioning a discrete flow matching TTS model on facial expressions and cross-modal alignment.
Control speaking rate on the fly in your TTS system with VoXtream2, which hits 4x real-time speeds and 74ms latency.
Time-pooled dimension reshaping unlocks more efficient scaling of speaker verification models, achieving state-of-the-art accuracy on VoxCeleb1 at a fraction of the computational cost.
Injecting user mood into music recommendation boosts perceived quality, proving that personalized listening experiences can be significantly improved by considering emotional state.
Despite the intuition that noisy environments should make models rely more on visual cues, AVSR models stubbornly cling to audio, even when it's heavily degraded.
Synthesize speech with unprecedented emotional control: a new causal training method lets you edit prosody "counterfactually" to express different emotions in the same utterance.
You can reliably decode frustration from facial muscle activity, even when people aren't speaking aloud.
LLMs are enabling silent speech interfaces to finally approach the word error rate threshold needed for real-world use by mapping fragmented physiological gestures into structured semantic latent spaces.
Ditch the heuristics: Hikari achieves state-of-the-art simultaneous speech translation by learning READ/WRITE decisions directly through a probabilistic WAIT token.
Forget rigid decision trees: a dynamically orchestrated agent slashes multimodal query processing costs by 67% while boosting speed and reducing rework.
Achieve state-of-the-art emotion recognition by fusing visual and audio cues with a bi-directional cross-attention mechanism, outperforming unimodal approaches.
LLMs can't tell when to shut up in multi-party conversations, but fine-tuning with reasoning traces can teach them some manners.
Achieve a 62.7% BLEU score boost in speech emotion captioning by offloading only the trickiest parts of the problem to the cloud.
Fine-tuning LALMs on just the right layers, guided by layer-wise analysis, unlocks better paralinguistic understanding than naively fine-tuning everything.
Forget MOS: a new preference-based metric, AnimeScore, finally cracks the code for automatically evaluating "anime-like" speech with 90.8% AUC.
By explicitly modeling and adapting to the reliability of audio and visual signals at different interaction stages, SAGE achieves more stable emotion estimation under cross-modal noise and occlusion.
DINOv2 visual features and Wav2Vec 2.0 audio features can be effectively fused in a two-stage model to achieve state-of-the-art facial expression recognition in challenging, unconstrained video conditions.
Expert-corrected phonetic transcriptions can approach the performance of MFCCs for vocal tract reconstruction from speech, suggesting phonetic information is a viable alternative to acoustic features.
You can reconstruct vocal tract shapes from clean speech almost as well as from noisy MRI recordings, opening the door to more practical articulatory analysis.
SEMamba++ significantly improves speech restoration by cleverly integrating frequency-domain inductive biases into a state-space model, outperforming existing methods while maintaining efficiency.