Search papers, labs, and topics across Lattice.
50 papers published across 5 labs.
WER hides the real story: new metrics reveal how language model rescoring in ASR impacts grammatical correctness and semantic accuracy.
Current ASR metrics, even those leveraging embeddings, fail to align with human perception of transcription quality, as revealed by a new human-annotated dataset.
Achieve near-native Indic TTS from a non-Indic base model at zero commercial-training-data cost by cleverly combining phoneme space unification, LoRA adaptation, and voice-prompt recovery.
Commercial TTS systems nailing WER scores can still butcher Indic accents, especially retroflex articulation, and this new benchmark exposes exactly where they fail.
Speaker embeddings leak script information, especially when projecting Western voices into Indic scripts, but LASE fixes this with a language-adversarial training objective.
Speaker embeddings leak script information, especially when projecting Western voices into Indic scripts, but LASE fixes this with a language-adversarial training objective.
WER hides the real story: new metrics reveal how language model rescoring in ASR impacts grammatical correctness and semantic accuracy.
Ditch the static image: this method generates realistic talking avatars by learning from *videos* of the subject in completely different scenes.
Visual cues become crucial for speech recognition when audio quality tanks in this challenging new benchmark derived from real-world conversations.
Successfully converting accents requires balancing accent modification with speaker identity preservation, a challenge that this survey unpacks by tracing the evolution of techniques from DSP to neural methods.
Stuttering isn't random: you can predict severe blocks and sound repetitions from just 3 seconds of audio with a tiny model that runs on your phone.
LLMs can guide phoneme editing to create synthetic accented speech from just a handful of examples, substantially improving ASR accuracy where training data is scarce.
Integrating visual cues into a long-context ASR system slashes word error rate by 16% in multi-talker conversational recordings, proving the power of AV fusion.
Unbury speech from cinematic sound effects by teaching the model to "listen" for how words are formed.
Current DeepFake detectors can be fooled by semantically inconsistent real audio and video, highlighting a critical blind spot in their ability to assess realistic manipulations.
Unlocking the full spectrum of animal sounds, previously discarded by standard audio models, can significantly improve bioacoustic classification.
AI sign language translation tools, despite their promise, may actually reinforce ableism by prioritizing technical standardization over the cultural and linguistic nuances of Deaf communication.
General American English ASR performance doesn't guarantee similar accuracy across other English accents, as revealed by a new multi-accent call center dataset.
Current ASR metrics, even those leveraging embeddings, fail to align with human perception of transcription quality, as revealed by a new human-annotated dataset.
Thai voice cloning just leapfrogged human performance on short-duration speech, thanks to a new model that directly handles code-switching and numerals.
Achieve significantly better room acoustics analysis by extending wavelet denoising to low frequencies.
Forget static emotion labels – EmoTransCap lets you generate speech captions that actually track how emotions evolve in a conversation.
Discover hidden biases in your speech datasets: this toolkit uses non-speech audio to reveal spurious correlations that inflate performance metrics.
Audio deepfake detectors trained on diffusion-reconstructed "hard" examples generalize far better to unseen attacks, slashing error rates compared to standard training.
Finally, voice anonymization offers a smooth, tunable knob to balance privacy and prosody, instead of forcing you to pick just one.
Depression leaves a detectable fingerprint in the way our vocal system revisits acoustic states during conversation, revealing new avenues for digital biomarkers.
Semantic priors in neural speech codecs hit a wall: their benefits plateau beyond 6 kbps, revealing a fundamental limit to improving intelligibility at higher bitrates.
Widely used emotion embedding similarity metrics for speech generation are more sensitive to speaker and linguistic features than actual emotion, rendering them unreliable for evaluating emotional expressiveness.
Adversarial training doesn't have to hurt speaker verification: by explicitly modeling language, you can disentangle speaker and language characteristics without sacrificing speaker discriminability.
Transferring phonetic knowledge from one language to another can dramatically improve automatic phonetic transcription, even enabling the recognition of entirely new phonetic features.
RLVR, the dominant training paradigm for audio language models, may be turning them into unfeeling "answering machines" that excel on benchmarks but fail the vibe check.
Achieve near-native Indic TTS from a non-Indic base model at zero commercial-training-data cost by cleverly combining phoneme space unification, LoRA adaptation, and voice-prompt recovery.
Skip the bulky bidirectional teacher: this new method trains a fast, causal audio-video generator directly, slashing sampling steps while maintaining top-tier quality.
SER's noble aspirations of voice-activated healthcare are undermined by datasets that bear little resemblance to real-world emotional expression.
Emotion recognition can be significantly improved by adapting to individual expressive traits, with ML-SAN outperforming static models in capturing nuanced emotional expressions.
Overcome the scarcity of paired data in speech-preserving facial expression manipulation by personalizing visual-language model prompts with individual visual information and correlating changes in visual and semantic features.
Explicitly modeling the dependency between dialogue context and current utterance as an "interpretation cue" significantly boosts conversational multimodal understanding.
Key contribution not extracted.
WhisperPipe achieves 3-5x lower latency than existing streaming ASR solutions while consuming significantly less memory, making it a game-changer for real-time applications.
Semantic-level uncertainty estimation methods significantly enhance the reliability of audio-aware language models, outperforming traditional approaches in critical reasoning tasks.
Want to sound cute? Korean speakers systematically raise their F1 formant when using "aegyo" speech, effectively mimicking a smaller vocal tract.
SymphonyGen's 3D hierarchical approach to music generation lets you steer the overall structure of a symphony without sacrificing the richness and detail of the orchestration.
Angular similarity in supervised contrastive learning can match the performance of cosine similarity for deepfake audio detection, but with significantly less reliance on computationally expensive negative sampling.
Synthetically generated data from multi-model ensemble distillation can significantly boost the intelligibility of cross-lingual voice cloning systems for scientific speech without sacrificing speaker similarity.
Under-resourced languages can be accurately differentiated using rhythm alone, but combining rhythmic and spectral features unlocks even higher classification accuracy.
Speaker recognition accuracy improves dramatically when leveraging a U-Net-based fusion of noisy and enhanced speech, coupled with a novel training strategy.
Forget predictable AI tools – this performance co-creates music through entangled feedback loops between humans and AI instruments, blurring the lines of agency.
Commercial TTS systems nailing WER scores can still butcher Indic accents, especially retroflex articulation, and this new benchmark exposes exactly where they fail.
Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.
Ditch the slow sampling: DriftSE achieves state-of-the-art speech enhancement in a single step, outperforming diffusion models with a novel equilibrium-based approach.
Segmenting music into meaningful chunks and predicting chords sequence-to-sequence boosts recognition accuracy, especially for those pesky, rare non-triad chords that plague existing systems.
ASR systems can now be more trustworthy: this work shows how to train them to abstain from transcribing uncertain segments, leading to more reliable outputs.
Audio-Language models are cheating on benchmarks, acing tests even when they barely listen.
Shrinking massive audio foundation models by up to 61x is now possible without significant performance loss, thanks to a novel self-supervised distillation approach that works directly on embeddings.
Disentangling high-level cross-modal reasoning from low-level modality-specific refinement in talking head generation yields superior lip-sync accuracy, video quality, and audio quality compared to entangled approaches.