Search papers, labs, and topics across Lattice.
100 papers published across 8 labs.
G-STAR tackles long-form, multi-speaker ASR by giving Speech-LLMs time-aware speaker tracking, enabling robust identity linking across chunks.
Iteratively refining target speaker extraction *without* retraining a model unlocks significant performance gains, offering a flexible and efficient approach to speech separation.
Uncover the hidden vulnerabilities of your voice anti-spoofing model with a new tool that quantifies the probability of failure against unseen speech synthesis attacks.
Skip the training: SimulU achieves state-of-the-art simultaneous speech translation by cleverly exploiting pre-trained models, opening the door to truly plug-and-play multilingual communication.
Achieve near-perfect audio steganography even under heavy MP3 compression by optimizing latent reconstruction and diffusion inversion errors.
G-STAR tackles long-form, multi-speaker ASR by giving Speech-LLMs time-aware speaker tracking, enabling robust identity linking across chunks.
Iteratively refining target speaker extraction *without* retraining a model unlocks significant performance gains, offering a flexible and efficient approach to speech separation.
Uncover the hidden vulnerabilities of your voice anti-spoofing model with a new tool that quantifies the probability of failure against unseen speech synthesis attacks.
Skip the training: SimulU achieves state-of-the-art simultaneous speech translation by cleverly exploiting pre-trained models, opening the door to truly plug-and-play multilingual communication.
Achieve near-perfect audio steganography even under heavy MP3 compression by optimizing latent reconstruction and diffusion inversion errors.
Forget paired video-music training data: V2M-Zero aligns video and music by matching the *timing* of changes within each modality, not the content itself.
You can now automatically isolate coughs from audio with 96% precision using just the first three layers of a pre-trained XLS-R model, paving the way for smartphone-based TB screening.
LLM-based ASR can be sped up by 4.4x with minimal accuracy loss by using a CTC encoder to speculatively generate draft transcriptions.
LoRA fine-tuning can significantly boost the voice cloning capabilities of LLM-based TTS systems, but only if the training data is acoustically diverse enough.
Geospatial context is a surprisingly effective prior for audio tagging, especially when sounds are acoustically similar, leading to improved performance over audio-only methods.
Speech quality assessment is skewed: male listeners consistently give higher scores than female listeners, and standard MOS models learn and perpetuate this bias.
Explicitly aligning audio and video streams in a multimodal Transformer boosts emotion recognition, showing that ignoring frame-rate differences hurts performance.
Human-preference aligned audio generation from video is now possible, as V2A-DPO surpasses previous methods by directly optimizing for semantic consistency, temporal alignment, and perceptual quality.
LLMs can spot fake words in speech by recognizing common editing patterns, but this reliance on learned biases hinders generalization to new manipulation techniques.
Ditch slow, multi-step sampling for target speaker extraction: AlphaFlowTSE achieves faster, one-step generation with improved speaker similarity and real-world generalization.
Speech tokenizers, despite being crucial for multimodal LLMs, primarily capture phonetic information, creating a semantic mismatch with text-derived semantics that hinders performance.
Wearable sensors and speech AI can now unobtrusively reveal the hidden communication dynamics driving hospital caregiver workload and stress.
Speech deepfake detection gets a reasoning upgrade: HIR-SDD uses chain-of-thought prompting with Large Audio Language Models to not only detect fakes but also explain *why* it thinks they're fake.
Adapting ASR models to Huntington's Disease speech not only improves accuracy, but also reveals how biomarker-based supervision can reshape error patterns in ways that reflect disease severity.
Encoder-only multi-talker ASR can now rival LLM-based systems in accuracy while drastically reducing computational cost, thanks to a novel distillation approach and talker-count routing.
A single LLM can now handle both non-streaming and streaming ASR, opening the door to more flexible and efficient speech recognition systems.
You can slash ASR error rates in low-resource languages by over 60% with a simple continued pretraining recipe.
Imagine an XR experience where you can selectively isolate and enhance individual sound sources in real-time, making chaotic audio environments crystal clear.
A fully open-source speech understanding model, OSUM-Pangu, proves that competitive performance is achievable on non-CUDA hardware, challenging the dominance of GPU-centric ecosystems.
A single system now rivals or beats specialized models across ASR, voice activity detection, language ID, and punctuation, setting a new bar for industrial-grade speech processing.
Fair-Gate disentangles speaker identity and sex in voice biometrics, boosting fairness without sacrificing accuracy by explicitly routing features through identity and sex-specific pathways.
Speech-aware LLMs are surprisingly bad at speaker verification, but a simple embedding injection trick closes the gap with dedicated systems while preserving the LLM's language abilities.
A nose-mounted microphone and vibration sensor combo unlocks robust, low-audibility speech interfaces for always-on AI interaction, even in noisy environments.
Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.
A Goldilocks zone exists for neural audio codec quantization depth, where intermediate levels strike the best balance between suppressing adversarial noise and preserving speech content for robust ASR.
Tired of LLM judges hallucinating when evaluating long, detailed speech captions? EmoSURA offers a more reliable, audio-grounded alternative by verifying atomic perceptual units.
LALMs struggle to handle multiple concurrent audio inputs, but a simple input permutation strategy can significantly boost their multi-audio understanding without retraining.
Controllable emotion style transfer in speech is now possible without needing paired data, opening new avenues for data augmentation and expressive AI.
Statistical regularities in phoneme frequency distributions, previously thought to arise from optimization, may instead be natural consequences of diachronic sound change.
Spatial audio cues and directional priors can be jointly learned end-to-end to significantly boost keyword spotting accuracy in noisy environments, outperforming traditional cascaded approaches.
Unlock realistic acoustic simulations with a text prompt: fine-tuning a text-to-audio model generates plausible room impulse responses, even with limited paired data.
Modern speech enhancement algorithms may not improve ASR performance in realistic noisy environments, challenging assumptions about their effectiveness in real-world applications.
Finally, a single model that can generate both your face and voice, convincingly controlled by text prompts and reference clips.
Forget slow, iterative distributed signal estimation: dMWF achieves optimal multichannel Wiener filtering in wireless acoustic sensor networks without iteration, even when nodes observe different sources.
You can predict the best moment to offer emotional support just by listening to someone's voice, no text needed.
Double the emotion conversion accuracy in voice conversion models with a simple prefix that jointly controls sequence modulation and acoustic realization.
Unlock full-duplex speech-to-speech dialogue without VAD limitations using chunk-wise micro-turns and special control tokens to steer LLM behavior in a cascaded pipeline.
Text prompts might be inflating your SLLM's performance: spoken prompts reveal a significant performance gap, especially in low-resource languages.
Achieve comparable speech restoration quality with conditional diffusion models using 10x fewer neural network evaluations via a novel iSDE solver.
Transform unstructured audio-visual signals into machine-readable structured knowledge with the Logics-Parsing-Omni model, which enforces strict alignment between high-level semantics and low-level facts.
Forget tweaking knobs – this new Gram-matrix-based audio representation lets you *retrieve* the perfect, editable audio effect preset, outperforming standard methods.
A meticulously curated, bidirectional English-German corpus of parliamentary proceedings now offers researchers a goldmine for dissecting the nuances of translation, interpreting, and language variation through an information-theoretic lens.
By cleverly "self-rephrasing" LLM outputs, this work coaxes reasoning LLMs to handle audio inputs without sacrificing their chain-of-thought abilities.
Forget confidence scores: a modality-aware early exit strategy for spoken language models slashes decoding costs without sacrificing accuracy or perceptual quality, revealing that speech tokens require specialized handling compared to text.
Forget coarse-grained audio-visual tasks: RA-SSU offers frame-level sound source understanding with two new datasets and a transformer-based benchmark.
Forget black-box audio synthesis: this differentiable engine sound model gives you interpretable knobs to control physical parameters like valve dynamics and exhaust resonances.
Contrastive Decoding's power-up for audio language models hinges on fixing specific error types, like uncertainty and audio absence, but don't expect it to magically fix flawed reasoning.
Get state-of-the-art spoken QA performance by adding lightweight speech modules to frozen VL models and training on synthetically generated speech data, sidestepping the need for massive multimodal datasets.
Studio-quality speech enhancement without hallucination is now possible, thanks to a clever combination of dry-target finetuning and flow-matching.
VR agents that "listen" to your tone, not just your words, elicit significantly better user experiences.
Open-sourcing a fully reproducible, optimized Band-Split RNN for music separation, this paper reveals the surprisingly large gap between published results and what can be achieved with a faithful reimplementation, even with significant effort.
Forget wavelets, transformers with Koopman operator-derived features unlock superior ECG classification, especially in complex multi-class scenarios.
Mamba's superior sequence modeling lets you generate longer, more realistic dance sequences than clunky Transformers ever could.
Text-to-audio diffusion just got a whole lot faster: SoundWeaver slashes latency by up to 3x without retraining, simply by cleverly reusing similar audio samples.
Adversarial training and synthetic data can significantly boost multilingual speaker verification performance, even with limited training data.
A modular statistical transformation pipeline boosts audio deepfake detection accuracy by 10.7% in cross-domain scenarios, without needing labeled target data.
LoopLens reveals a stark divide in how musicians with and without domain expertise approach creative search for music loops, highlighting the need for vocabulary-independent discovery tools.
Open-source TTS gets a serious upgrade with Fish Audio S2, offering instruction-following control via natural language and production-ready streaming performance.
Speech LLMs, though lagging in accuracy, capture the nuances of human emotion perception better than traditional supervised methods, a finding revealed by the new VoxEmo benchmark.
Paralinguistic speech tasks aren't as language-agnostic as we thought: cross-lingual transfer patterns reveal systematic language dependencies.
Emirati Arabic finally gets a dedicated, sociolinguistically rich speech corpus, opening doors for better ASR/TTS in this low-resource language.
LALMs can now better capture the nuances of human emotion, moving beyond single-label predictions with a new ambiguity-aware training framework that aligns model outputs with the full spectrum of human perception.
Turns out your always-on speech dialogue model is leaking speaker identity like a sieve, but a simple feature-domain anonymization technique can boost privacy by 3.5x with minimal impact on performance.
Language models can beat FLAC for lossless audio compression at 8-bit and 16-bit, but their advantage shrinks at 24-bit, revealing a challenge for high-fidelity audio.
A new benchmark, PathBench, finally allows for standardized comparison of pathological speech assessment methods, revealing that the proposed Dual-ASR Articulatory Precision (DArtP) metric outperforms existing reference-free approaches.
Spectrograms beat MFCCs for South Asian sound classification, unlocking more accurate analysis of complex, overlapping urban soundscapes.
Ditch slow, sequential decoding: NLE achieves 27x speedup over autoregressive ASR by using a non-autoregressive, LLM-based transcript editing approach.
MLLMs can now reliably interpret electromagnetic signals even in noisy environments, thanks to a new training framework and benchmark designed specifically for this challenging domain.
A dual-branch Transformer with safe cross-attention overcomes missing visual cues in emotion recognition by dynamically relying on audio, achieving state-of-the-art results on Aff-Wild2.
Silence timeouts are out: DualTurn learns natural turn-taking from unlabeled dual-channel audio, outperforming larger models and anticipating turns more accurately.
Unlock AV speech recognition for any language, even with zero labeled video data, by training on synthetically generated talking-head videos.
Even when overall accuracy seems balanced, audio deepfake detection models can exhibit significant gender bias, masked by standard metrics like EER.
Speech models can now be quantized to INT4 with near-lossless performance thanks to a new evolution strategy-based calibration method tailored for audio activations.
Now a single speech foundation model can generate diverse utterance-level representations, like semantics and speaker identity, opening new possibilities for multimodal and multilingual applications.
Multi-view Echo data can be used to train ECG encoders that are 18x smaller yet outperform larger models at predicting cardiac morphology.
Range-Null Space Decomposition offers a surprisingly effective and scalable approach to neural vocoders, outperforming existing methods while using a lightweight network structure.
Foley-Flow achieves state-of-the-art video-to-audio generation by aligning audio-visual representations with masked modeling, enabling rhythmic synchronization that was previously lacking.
A new benchmark reveals how existing audio-visual segmentation models crumble when faced with the dynamic, ever-changing audio and visual environments of the real world.
Uncover deepfakes by exploiting the tell-tale audio-visual inconsistencies embedded within generative models' cross-attention mechanisms.
By explicitly modeling speech, SAVE leapfrogs existing audio-visual methods for video-text retrieval, achieving substantial gains over the state-of-the-art.
Self-supervised and visually grounded models are closing the gap in explaining how infants learn language from raw acoustic and visual input, challenging the need for strong linguistic priors.
Achieve zero-shot voice conversion competitive with methods requiring more data or training, using a simple, invertible linear method to disentangle speech content from speaker timbre.
Unlock whisper-to-normal speech conversion with a clever trick: synthesize whispered speech from readily available normal speech data to massively augment training.
Finally, realistic 3D avatars can maintain natural eye contact and spatial awareness during conversations, moving beyond disembodied "talking heads."
You can now poison a zero-shot TTS model to prevent it from generating speech for specific target speakers, but scaling this defense to a large number of speakers remains a challenge.
Unleashing the power of multi-view lip reading, this new framework lets you extract a target speaker's voice even from challenging, non-frontal video angles.
SLMs still lag behind omni language models in multi-turn conversational style control, as revealed by the new StyleBench benchmark.
Forget expensive, noisy recordings: this procedural engine sound dataset offers 19 hours of clean, annotated audio for training better automotive sound AI.
You can protect patient privacy and still detect Parkinson's from speech, but only if you choose the right anonymization method.
Achieve near-perfect accuracy in real-time malicious speech detection without sacrificing transcription speed, using a lightweight model built on Whisper.
Forget full fine-tuning: Low-rank adapters let you adapt speech enhancement models to new acoustic environments on-device, updating less than 1% of parameters for significant quality gains.
Forget massive multilingual models: fine-tuning on just 5 hours of speech data from a related language slashes ASR error rates for an endangered language, rivaling the performance of Whisper-Small.
Zero-shot multilingual TTS models stumble when synthesizing Kashmiri, but a script-aware, flow-based adaptation strategy unlocks intelligible speech.
Achieve accent-specific speech synthesis without any accented training data by cleverly combining phonological rules with multilingual TTS.
Control the accent of your TTS output without needing any accented training data, by transferring accent characteristics from other languages.