Search papers, labs, and topics across Lattice.
42 papers published across 2 labs.
Achieve superior video generation quality and temporal coherence without expensive retraining by intelligently scaling and steering diffusion models at test time.
Fine-tuning a chord generation model on a new genre requires only a surprisingly small amount of old-genre data to prevent catastrophic forgetting, but objective metrics don't always capture subjective stylistic preferences.
Standard multimodal fusion can hurt performance in emotion recognition, but this new approach knows when to drop modalities, leading to state-of-the-art results.
Human crowdsourcing struggles to reliably identify audiovisual deepfakes, especially when both audio and video are manipulated, suggesting current detection methods may overestimate human capabilities.
Your innocent Spotify playlists are leaking surprisingly accurate predictions about your age, habits, and even personality traits, thanks to new AI attack.
Achieve superior video generation quality and temporal coherence without expensive retraining by intelligently scaling and steering diffusion models at test time.
Fine-tuning a chord generation model on a new genre requires only a surprisingly small amount of old-genre data to prevent catastrophic forgetting, but objective metrics don't always capture subjective stylistic preferences.
Standard multimodal fusion can hurt performance in emotion recognition, but this new approach knows when to drop modalities, leading to state-of-the-art results.
Human crowdsourcing struggles to reliably identify audiovisual deepfakes, especially when both audio and video are manipulated, suggesting current detection methods may overestimate human capabilities.
Your innocent Spotify playlists are leaking surprisingly accurate predictions about your age, habits, and even personality traits, thanks to new AI attack.
Turns out you only need to tweak a few key audio tokens to jailbreak audio language models, opening the door to faster, more targeted attacks.
E-graphs can help AI learn the unwritten rules of jazz harmony, mirroring how human musicians internalize complex musical patterns.
Unlock scalable, high-quality singing voice synthesis by directly generating structured musical scores from audio, outperforming existing systems on multiple datasets.
Audio diffusion models can be trained more efficiently by dynamically adjusting optimization strategies based on the evolving balance between semantic acquisition and fine-detail refinement during training.
LLMs can now evaluate audio as well as humans, without task-specific training, thanks to a new instruction-driven framework.
Audio-native LLMs still lag behind cascaded architectures in key audio tasks, suggesting that the multimodal promise of LLMs isn't quite ready for prime time in the sound domain.
Bio-inspired signal processing lets you hear subtle underwater sounds better than ever, achieving 98.41% accuracy in classifying targets even in noisy conditions.
Unlock near-oracle speech enhancement performance from compact microphone arrays by virtually expanding their spatial coverage with a novel neural network.
Aesthetic quality unlocks better generalization in AI-generated music popularity prediction, beating models trained solely on engagement metrics.
Even with domain adaptation, your keystrokes are still vulnerable to acoustic side-channel attacks across diverse keyboards, users, and noisy environments.
Automating stage lighting control across diverse venues is now possible without expert demonstrations, thanks to a novel imitation learning approach that decomposes global color distributions into individual light controls.
Make your ASR models 25% more accent-robust with this surprisingly simple contrastive loss trick.
Sound event detection gets a reality check: a new framework tackles the messy, unpredictable world of unseen sounds, not just the curated ones.
Stem retrieval accuracy leaps forward by 70% thanks to a new architecture that finally respects the phase of music.
Guaranteeing stable beamforming in dynamic acoustic environments is now possible with a novel adaptive diagonal loading method that strictly bounds White Noise Gain.
Forget federated learning, bioacoustic classifiers can be unified across 661 species by simply averaging independently trained task vectors, unlocking a collaborative, privacy-preserving paradigm.
Modern speech models struggle to generalize to noisy, domain-specific African speech, highlighting a critical gap for localized voice AI.
Stop wrestling with disparate tools and languages for music performance analysis: Cosmodoit offers a unified Python pipeline for efficient, large-scale feature extraction.
Fusing speech and environmental sound representations with a novel matching head and cross-attention network significantly boosts deepfake audio detection, surpassing previous benchmarks.
Dramatically extend the battery life of bioacoustic sensors by embedding a highly accurate CNN classifier directly on a microcontroller, enabling selective recording of target species.
Classical speech codecs still outperform neural codecs in noisy environments, but speech enhancement can close the gap.
Even without retraining, a simple dual-system approach can significantly boost the performance of self-supervised talking head forgery detectors by refining the ordering of uncertain samples.
Open-sourcing a 0.1B-scale speech-native omni model lets you directly inspect the complete interaction loop and reveals critical design choices for building effective small multimodal models.
Neck-Learn's hybrid architecture, combining gradient-boosted trees and CNN-based multiple instance learning, unlocks improved ambulatory detection of vocal hyperfunction by preserving crucial temporal dynamics in voice data.
Strong differential privacy can cause speech classifiers to collapse into near-useless single-class predictors, but a two-stage training process involving distillation can stabilize training.
Transfer learning from a large, pre-trained speech synthesis model unlocks high-quality Tibetan TTS, even with limited Tibetan-specific data.
Existing deepfake detectors crumble when faced with realistic, multi-region speech inpainting, leaving a gaping vulnerability that this work begins to address.
Stop letting speaker identity drown out semantic similarity: this new embedding method lets you independently control the influence of different speech attributes when comparing utterances.
Despite the promise of multimodal context, current audio-language models struggle to leverage clinical information for dysarthric speech recognition, even degrading performance in some cases.
Adversarial attacks on speech models leave tell-tale geometric fingerprints in early representation layers that can be detected without transcripts.
Synthetic data closes the Indic ASR gap where commercial and open-source systems fail, boosting entity recognition by up to 22x.
Forget separate structure and fidelity models – Khala shows you can generate high-quality music with text-vocal alignment using a single acoustic-token hierarchy.
Expressive piano performance rendering is improving, but RenCon 2025 reveals we're still far from replicating human musicality.
Margin loss fine-tuning of ECAPA-TDNNs slashes the error rate in spoken language identification by over 50%, highlighting the power of discriminative representation learning.
Current audio-visual models nail unimodal quality but still struggle to make music and dance move together rhythmically, highlighting a key gap TMD-Bench is designed to address.
Audio-visual models can be significantly improved by delaying perceptual commitment, correcting intermediate fusion states only when they have sufficient cross-layer and cross-modal support.
Speaker embeddings leak script information, especially when projecting Western voices into Indic scripts, but LASE fixes this with a language-adversarial training objective.