Search papers, labs, and topics across Lattice.
The University of Tokyo
2
0
4
SMC-ITA achieves a remarkable 55.67% reduction in audio-video desynchronization, setting a new standard for inference-time alignment in video-to-audio generation.
Medical-specific vision-language models surprisingly underutilize visual information in Japanese medical licensing exams, often performing well even when images are removed, highlighting a critical gap in their multimodal reasoning capabilities.