Yanfeng Wang

Training a multimodal agent from scratch beats retrofitting existing LMMs with search tools, especially when you compress long interaction histories into visual summaries.

Yikun Liu, Yu-An Liu, Le Tian +4

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

Apr 14, 2026

1w ago·also SJTU

SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

ALMs can now pinpoint sounds in time with far greater accuracy, thanks to a new training method that stops them from hallucinating timestamps.

Luoyi Sun, Luoyi Sun, Xiao Zhou +6

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

1w ago

A Sanity Check on Composed Image Retrieval

Current Composed Image Retrieval benchmarks are misleading, as a new evaluation reveals that models struggle with query ambiguity and interactive scenarios.

Yikun Liu, Jiangchao Yao, Weidi Xie +1

Computer Vision Eval Frameworks & Benchmarks Multimodal Models+1

Apr 13, 2026

1w ago·also AI Laboratory, Department of Radiology, SJTU

Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach

Injecting rare disease knowledge into data synthesis and using self-supervised RL on pseudo-labels dramatically improves medical reasoning in LLMs, outperforming existing methods by up to 5.93% on rare disease tasks.

Haolin Li, Shuyang Jiang, Ruipeng Zhang +2

Reasoning & Chain-of-Thought RLHF & Preference Learning Scientific Discovery & Drug Design

1w ago·also WeChat AI

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

MLLMs can achieve near-identical performance on long-form visual tasks with just 2.5% of the original visual tokens by mimicking human visual attention.

Haicheng Wang, Yu-An Liu, Yikun Liu +7

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Multimodal Models

1w ago·also NJU

GenTac: Generative Modeling and Forecasting of Soccer Tactics

Soccer tactics, previously viewed as too stochastic for accurate modeling, can now be realistically simulated with a diffusion model that captures nuanced team styles and predicts future outcomes.

Jiayuan Rao, Tianlin Gui, Haoning Wu +2

Computer Vision World Models & Planning

Apr 7, 2026

Hongcheng Liu +52w ago·also SJTU

Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs

Omni-LLMs struggle to identify the same objects across different modalities, but a new dataset and training strategies can significantly improve their cross-modal reasoning.

Hongcheng Liu, Zhe Chen, Pingjie Wang +3

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought