Search papers, labs, and topics across Lattice.
65 papers published across 6 labs.
Local democracy's "public" input is heavily skewed towards older, whiter, more male, more liberal homeowners, and even removing remote access doesn't fix it.
Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.
Ditch the slow sampling: DriftSE achieves state-of-the-art speech enhancement in a single step, outperforming diffusion models with a novel equilibrium-based approach.
Segmenting music into meaningful chunks and predicting chords sequence-to-sequence boosts recognition accuracy, especially for those pesky, rare non-triad chords that plague existing systems.
ASR systems can now be more trustworthy: this work shows how to train them to abstain from transcribing uncertain segments, leading to more reliable outputs.
Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.
Ditch the slow sampling: DriftSE achieves state-of-the-art speech enhancement in a single step, outperforming diffusion models with a novel equilibrium-based approach.
Segmenting music into meaningful chunks and predicting chords sequence-to-sequence boosts recognition accuracy, especially for those pesky, rare non-triad chords that plague existing systems.
ASR systems can now be more trustworthy: this work shows how to train them to abstain from transcribing uncertain segments, leading to more reliable outputs.
Audio-Language models are cheating on benchmarks, acing tests even when they barely listen.
Shrinking massive audio foundation models by up to 61x is now possible without significant performance loss, thanks to a novel self-supervised distillation approach that works directly on embeddings.
Disentangling high-level cross-modal reasoning from low-level modality-specific refinement in talking head generation yields superior lip-sync accuracy, video quality, and audio quality compared to entangled approaches.
Finally, a TTS system that lets you control the *exact* timing and pauses of individual words, opening the door to applications like perfectly paced guided reading and accessible code narration.
Achieve state-of-the-art periodic signal denoising with a single, lightweight dilated CNN that generalizes across frequencies via resampling.
Counterintuitively, scaling up LLM decoders in speech recognition doesn't guarantee fairness; audio encoder design matters more, as Whisper's pathological hallucinations on Indian-accented speech and repetition loops under masking demonstrate.
SOTA audio QA models are getting punked by trivia questions a toddler could answer, revealing a stark gap between current capabilities and true audio understanding.
Pinpointing exactly *when* misinformation occurs in videos is now possible, thanks to two new datasets and a strong baseline for misinformation span detection.
Surprisingly, how speech degrades due to diseases like Parkinson's and ALS follows consistent patterns across languages, offering a universal fingerprint for these conditions.
Forget English – this study reveals which TTS systems truly resonate with native speakers across ten diverse Indian languages, pinpointing specific perceptual dimensions that drive preference.
Demystifying state-of-the-art speaker diarization just got easier: this tutorial breaks down the DiariZen pipeline block-by-block, complete with code, tensor shapes, and visualizations.
Basso continuo, a centuries-old improvised accompaniment, isn't just about following the rules – AI can now identify individual players by their unique stylistic fingerprints.
Current spoken dialogue systems struggle with the nuances of human conversation, but a new benchmark offers a path to more natural interactions by focusing on handling interruptions and overlapping speech.
Turns out where you look in Wav2vec 2.0's representations *really* matters: intelligibility lives in the layers, while articulation problems hide in the time steps.
Local democracy's "public" input is heavily skewed towards older, whiter, more male, more liberal homeowners, and even removing remote access doesn't fix it.
Forget text-only pre-training: training on music *first* can dramatically accelerate language learning in small language models.
LLMs can judge speech recognition quality with near-human accuracy, blowing away traditional metrics like Word Error Rate.
Current audio-language models are surprisingly bad at controlling and interpreting subtle vocal cues, failing in nearly half of situational dialogue scenarios.
Current omnimodal models may excel in perceptual tasks but fundamentally misunderstand music theory, exposing critical reasoning flaws.
Speaker verification systems can be made significantly more robust to whispered speech by using a simple encoder-decoder architecture and a joint training objective.
Key contribution not extracted.
Stuttered-speech research is missing the mark: a new study reveals a significant mismatch between current research priorities and the actual needs of people who stutter.
Finally, a practical OMR system can handle complex polyphonic music, like piano scores, by intelligently decoding visual symbols into editable scores.
Speakers expressing the same content with different emotions exhibit surprisingly consistent spatial-temporal correlations in their local facial animations, unlocking a new approach to speech-preserving facial expression manipulation.
A 3D-printable acoustic metamaterial can scramble your voiceprint at the physical layer, protecting your identity even when microphones are compromised.
Ditch your old MSS evaluation metrics: MERT-based embeddings correlate far better with human perception.
Separating sound scene and sound event deepfake detection as individual tasks dramatically improves performance, paving the way for more robust audio forensics.
LALMs struggle to ground their responses in audio, exhibiting surprising failures in temporal reasoning and music understanding that HalluAudio exposes.
Bridging the offline-streaming gap in ASR is now more achievable: a single RNN-Transducer model can deliver high accuracy in both settings, thanks to a novel consistency regularization technique.
Current ASR systems stumble significantly when faced with the nuances of real-world Indian speech, as revealed by a new benchmark exposing geographic performance disparities and the impact of audio quality, speaking rate, and device type.
Finally, you can puppeteer both the sights and sounds of AI-generated characters, controlling their identity, voice, pose, and scene with unprecedented precision.
Current human-robot interaction feels clunky because we lack the right development tools, so this work introduces a VR-based platform designed from the ground up to enable fluid error correction in Wizard-of-Oz robotic systems.
GaborNet, a Gabor filter-based front-end for raw audio processing, significantly boosts audio spoof detection accuracy in RawNet2 and RawGAT-ST architectures.
SpeechLLMs' hallucinations betray themselves in their attention patterns, offering a new way to detect these errors without needing expensive human-labeled data.
Smiling during traumatic recollection not only occurs in moments of distress but actively enhances emotional recovery and narrative coherence.
Music theory meets math: combinatorial geometry provides a surprisingly elegant framework for understanding and generating musical structures, from classical harmonies to 12-tone systems.
Autoregressive generative models, previously unsuitable for real-time target speaker extraction, can now achieve offline-level performance in streaming scenarios thanks to a novel chunk-wise splicing technique.
Forget complex event sequences: tokenizing music by uniform temporal beats unlocks better musical quality and structural coherence in generated music.
Seoul Korean pitch accent classification achieves state-of-the-art results by learning F0 contour representations with deep supervised contrastive learning, despite the inherent variability in real-world speech.
Finally, anime avatars can convincingly express a full range of emotions without losing their unique vocal identity.
Key contribution not extracted.
Ditch the clunky pipeline: a single LLM can now handle all your audio front-end needs, slashing latency and boosting accuracy in full-duplex speech interactions.
Achieve state-of-the-art TTS with significantly fewer parameters by explicitly modeling temporal dynamics in a cascaded architecture that implicitly handles phonetic planning.
Key contribution not extracted.
Finally, a dataset large and diverse enough to train robust models for Quranic speech research.
FreezeEmpath achieves superior empathetic dialogue capabilities without the need for costly finetuning, relying instead on frozen LLMs and existing data.
Time-frequency feature extraction via fractional Fourier transform unlocks surprisingly high-quality music generation from LSTMs.
Bias against certain speaker groups is embedded in self-supervised speech models from the very first layers, complicating efforts to achieve fairness in speech recognition tasks.
Phoneme recognition accuracy in low-resource languages hinges more on data availability than phonological complexity, revealing critical insights for ASR model development.
LLM-based ASR can be shrunk to 2.3B parameters and still beat larger models in real-world scenarios by carefully delineating encoder and LLM roles and using a multi-stage training approach.
CanonSLR achieves unprecedented robustness in sign language recognition by effectively bridging the gap between frontal and non-frontal viewpoints.
Open-source TTS models can beat commercial systems in specific languages, but current instruction-following TTS still struggles with complex instructions like nuanced paralinguistic controls.
SPARC features unlock more accurate and interpretable sEMG-based silent speech modeling compared to traditional phoneme representations.
LLMs can learn musicality without human annotation by aligning them to automatically generated preference datasets derived from rule-based musical constraints.
By explicitly verifying the visual existence of spoken references before segmentation, APRVOS substantially improves robustness in noisy audio-conditioned Ref-VOS, outperforming standard pipelines.
Forget supervised fine-tuning: RL alone can unlock high-quality chain-of-thought reasoning in audio-language models, even starting from a model with no prior CoT capability.
Multimodal LLMs aren't just for generation: they can dramatically improve audio-text retrieval robustness, especially when handling complex, real-world queries and acoustically similar distractors.
Control the groove: a latent-space Fourier transform lets you remix and blend musical styles by directly manipulating the frequency components of musical structure.
Bridging the gap between audio reconstruction and language modeling objectives yields neural audio codecs that are both more acoustically faithful and linguistically predictable.
Hebbian learning, often relegated to theory, can actually boost accuracy and stability in incremental audio classification tasks by selectively tuning network kernels.
Flash-SemiCRF slashes memory requirements for segment-level inference, making it feasible for genomic sequences over 100,000 positions.