Search papers, labs, and topics across Lattice.
42 papers published across 2 labs.
VAANI's open-sourced dataset offers unprecedented coverage of India's linguistic landscape, finally giving researchers the data needed to build truly inclusive speech models.
Forget tedious manual editing: CutClaw's multi-agent system can automatically transform hours of raw footage into engaging, rhythm-aligned short videos.
Real-time vocal denoising is now possible with deep learning, achieving significant SNR improvements at under 10ms latency.
Northern Kurdish finally gets its due with FLEURS-Kobani, a new benchmark dataset that exposes the challenges and opportunities for ASR and speech translation in this under-resourced language.
Global speech slowing, a common strategy for improving intelligibility, is outperformed by targeted, data-driven speech rate adjustments that listeners don't even consciously notice.
Forget tedious manual editing: CutClaw's multi-agent system can automatically transform hours of raw footage into engaging, rhythm-aligned short videos.
Real-time vocal denoising is now possible with deep learning, achieving significant SNR improvements at under 10ms latency.
Northern Kurdish finally gets its due with FLEURS-Kobani, a new benchmark dataset that exposes the challenges and opportunities for ASR and speech translation in this under-resourced language.
Global speech slowing, a common strategy for improving intelligibility, is outperformed by targeted, data-driven speech rate adjustments that listeners don't even consciously notice.
LLMs can classify dialects with surprising accuracy when given linguistic hints, suggesting a new way to leverage their knowledge for low-resource language tasks.
Thiomi slashes Swahili ASR error rates by 61% and unlocks nine more African languages for multimodal AI, proving community-driven data collection can leapfrog existing benchmarks.
LLMs can achieve state-of-the-art multilingual speech recognition by smartly handling noisy phoneme inputs, even with severe data imbalance across languages.
The first publicly available dataset for Syrian Arabic Sign Language (SyArSL) opens the door for machine translation research to improve accessibility for a historically underserved community.
Current multimodal dialogue models struggle to capture the nuanced expressiveness of human interaction, but a new dataset and benchmark reveal exactly where they fall short.
Turn monaural video into immersive binaural audio with SIREN, a visually-guided framework that learns spatial audio cues without task-specific annotations.
Forget "spread" voicings: skewness is the key to clarity in piano chords, offering a fresh perspective on psychoacoustic principles.
Ditching mel-spectrograms unlocks surprisingly better text-to-speech, as LongCat-AudioDiT proves that waveform latent diffusion can beat the state-of-the-art in zero-shot voice cloning.
By disentangling speakers earlier in the process, SR-CorrNet avoids the information bottleneck that plagues existing speech separation models, leading to improved performance in challenging acoustic environments.
State-of-the-art Large Audio Language Models are surprisingly vulnerable to hallucination attacks, with success rates as high as 95%, revealing a critical reliability gap masked by standard benchmarks.
Arabic mispronunciation detection just got a whole lot better: F1-scores jumped by 0.28 thanks to novel architectures and a new dataset of authentic mispronunciations.
SONAR can "see" road damage and material even when cameras and LiDAR are blinded by rain or fog.
VLMs can unlock insights from troves of historical documents previously inaccessible due to OCR limitations, achieving state-of-the-art transcription and speaker tagging of Italian parliamentary speeches.
You can slash 7-14% of parameters from your SLAM-ASR system by pruning the Whisper encoder and using LoRA, even outperforming the original model in some cases.
Voice control, previously insufficient for block-based programming, can now enable children with motor disabilities to effectively use Scratch, thanks to a novel multi-stage speech recognition pipeline.
Forget disjointed workflows: AutoCut's unified token space for video, audio, and text slashes ad production costs while boosting consistency.
Diffusion models can now reliably fill in the gaps in real-world spatial audio data, boosting the performance of microphone arrays.
Unlock a complete picture of vocal tract articulation from speech using MRI data, surpassing the limitations of traditional sensor-based methods.
Forget hand-tuning for each language: this recipe achieves state-of-the-art phone recognition across 100+ languages, revealing the surprising power of scaling data and SSL representations.
Style-controllable speech synthesis just got a major upgrade: ParaSpeechCLAP lets you dial in nuanced speaker traits and situational contexts far beyond what existing models can handle.
Finally, a way to represent the messy, collaborative syntax of real spoken language in treebanks.
Now you can turn a single image into a navigable 3D world complete with spatial audio, opening the door to richer immersive experiences.
Evolving interpretable composite features via Genetic Programming beats black-box deep learning at music tagging, revealing synergistic interactions and transformations that boost performance.
Scale expert know-how in tool-intensive industrial workflows with a voice-guided system that cuts process time and boosts repeatability.
VR telepresence in Wizard-of-Oz studies doesn't just feel more immersive, it fundamentally changes the interaction dynamics, fostering stronger social connections and more natural conversational flow compared to traditional GUI-based interfaces.
Grounding audio language models with acoustic feature representations unlocks more accurate and explainable deepfake detection, even with smaller models.
Achieve competitive speech enhancement with a highly compact (85 parameter) probabilistic model that continuously adapts to user and environment, suggesting a path towards truly personalized and adaptive hearing aids.
LALMs leak speaker identity by memorizing the link between voice and text, not just the content of speech.
Cinematic speech data unlocks more realistic and controllable voice generation from natural language descriptions.
Skip expensive human ratings: this hierarchical multimodal model accurately predicts human perception of AI-dubbed content quality using only audio, video, and text inputs.
VAANI's open-sourced dataset offers unprecedented coverage of India's linguistic landscape, finally giving researchers the data needed to build truly inclusive speech models.
Ditch the grid: BiFormer3D uses a spatial-encoding Transformer to reconstruct personalized 3D audio from sparse measurements, outperforming prior art without relying on frequency-domain hacks or minimum-phase assumptions.
LALMs still struggle to truly "hear" music, as revealed by a new expert-curated benchmark that exposes their reliance on non-musical shortcuts.
LLMs can achieve state-of-the-art audio-visual segmentation without any training by using a multi-agent system that explicitly reasons about expression difficulty and validates segmentation results.
Despite progress, accurately transcribing music with multiple instruments, complex polyphony, and diverse timbres remains a significant hurdle for AI.
LALMs struggle more with *hearing* the evidence than *reasoning* about it, and EvA's evidence-first fusion architecture proves it.
Balancing the diversity of real and AI-generated speech data is the key to building deepfake detectors that actually generalize.
Turns out your fancy speech recognition model might stumble after a workout: performance degrades significantly on post-exercise speech, and the best model varies depending on whether you fine-tune it.