Search papers, labs, and topics across Lattice.
36 papers published across 5 labs.
Unlock the potential of full-duplex speech language models with Sommelier, a new open-source pipeline that tackles the messy reality of multi-speaker conversations.
Foundation models trained on audio, general time series, and brain signals can be distilled into a single, powerful encoder for scientific time series, unlocking performance gains on par with task-specific training.
Multi-corpus training can actually *hurt* spoofing detection, unless you strip out dataset-specific biases with this clever domain-invariant feature extraction trick.
Unsupervised phoneme discovery from self-supervised speech models is surprisingly viable, but language-specific challenges remain a significant hurdle.
LLMs' text-only pre-training secretly encodes surprisingly different levels of auditory knowledge, directly impacting their effectiveness as backbones for audio language models.
Unlock the potential of full-duplex speech language models with Sommelier, a new open-source pipeline that tackles the messy reality of multi-speaker conversations.
Foundation models trained on audio, general time series, and brain signals can be distilled into a single, powerful encoder for scientific time series, unlocking performance gains on par with task-specific training.
Multi-corpus training can actually *hurt* spoofing detection, unless you strip out dataset-specific biases with this clever domain-invariant feature extraction trick.
Unsupervised phoneme discovery from self-supervised speech models is surprisingly viable, but language-specific challenges remain a significant hurdle.
LLMs' text-only pre-training secretly encodes surprisingly different levels of auditory knowledge, directly impacting their effectiveness as backbones for audio language models.
Synthesizing realistic room acoustics from a single recording is now possible, thanks to a novel flow-matching approach that captures the uncertainty inherent in acoustic environments.
Ditch one-hot vectors: representing facial action units as natural language unlocks more realistic and nuanced facial expression synthesis, especially when dealing with conflicting muscle movements.
You can predict how engaged and attracted viewers are to a video lecture just by analyzing the speaker's face and voice, no audience data needed.
Current OmniLLMs stumble when processing real-world, long-form audio-visual content, achieving only ~35-65% accuracy on a new benchmark designed to test long-term memory and fine-grained understanding.
Foundation models for EEG can now be 377x more efficient and handle 12x longer sequences, thanks to a novel Mamba-based architecture that also cracks the code for handling variable electrode setups.
LALMs still struggle to get the joke, with a new benchmark showing they can't reliably recognize, locate, or understand audio puns.
Humanoid robots can now generate more empathetic and instruction-aware gestures thanks to a new diffusion framework conditioned on affective estimation and pedagogical reasoning.
Training a DNN to recover a reverberant signal from a *more* reverberant version surprisingly reduces reverberation in the original signal.
Achieve state-of-the-art joint audio-video generation with fewer resources by fixing key flaws in cross-modal context handling within dual-stream transformers.
SLMs are shockingly vulnerable: combining adversarial audio and text unlocks 1.5x to 10x higher jailbreak rates than attacking either modality alone.
Unlock scalable cardio-sleep insights by repurposing ubiquitous single-lead ECG data for accurate sleep phenotyping, rivalling resource-intensive polysomnography.
Achieve controllable and scalable speech generation with MOSS-TTS, enabling zero-shot voice cloning and long-form synthesis.
LLMs can extract consistent, multidimensional semantic information directly from the phonological structure of language, revealing a non-arbitrary relationship between sound and meaning.
Spotify's GLIDE model proves that generative LLMs can drive significant gains in podcast discovery and non-habitual listening in a real-world, production environment.
Counterintuitively, better speech recognition unlocks surprisingly accurate Alzheimer's detection from simple text analysis, outperforming more complex acoustic models.
Stop struggling with the stability-plasticity dilemma in multilingual Speech-LLMs: Zipper-LoRA dynamically disentangles LoRA updates to boost low-resource ASR without sacrificing cross-lingual transfer.
Achieve single-pass alignment of multi-talker speech – a feat previously impossible – by modeling overlaps as shuffles.
Finally, a unified framework lets you control both facial appearance and voice timbre for personalized audio-video generation across multiple identities.
Interactive avatars can now exhibit more emotionally appropriate and contextually aware facial behaviors thanks to a novel architecture that disentangles audio-driven lip movements from user-driven non-lip facial expressions.
Adversarial training can effectively disentangle session-specific noise from task-relevant speech features in brain-computer interfaces, leading to more robust decoding across recording sessions.
Grounding LALM reasoning in diverse, reliability-weighted acoustic evidence blows away the competition in Audio Question Answering, proving that verifiable chains beat black boxes.
Sound source localization gets a reliability upgrade: conformal prediction delivers uncertainty estimates, even when you don't know how many speakers are talking.
Mimicking human cognition, FLAIR lets dialogue models "think while listening," boosting performance without adding latency.
Pre-training on nasal vs. oral context lets a simple model beat large pre-trained speech models at detecting speech disorders in noisy, real-world settings.
Even when visual data is missing or noisy, EgoAdapt accurately determines who is talking to the camera wearer by adaptively integrating head orientation, lip movement, and robust audio features.
Acoustic and phonetic NACs encode accent in fundamentally different ways, with implications for how we interpret and manipulate these representations.
Control the emotional tone of generated speech without any training by directly manipulating specific neurons within large audio-language models.
Imagine seeing your tongue move in real-time based on the sounds you make – AURORA brings that closer to reality.
Audio backdoor attacks leave a tell: triggers are surprisingly stable to destructive noise but fragile to meaning-preserving changes.
By explicitly modeling cardiac pathology, this ECG reconstruction method achieves a 76% reduction in error compared to existing techniques, promising more accurate diagnoses from portable devices.
Oral exams, previously impossible to scale, can now be delivered for pennies using voice AI, but controlling LLM behavior requires architectural guardrails, not just clever prompts.