Yu Li

D plane. For video or event sequences, a temporal position embedding, EtempE_{\text{temp}}, is additionally incorporated to capture the sequential order of the frames. To maintain modality differentiability within the unified feature space, tokens from auxiliary modalities, such as event streams, are marked with a unique modality type embedding, EmodE_{\text{mod}}. Furthermore, to efficiently handle temporal inputs with multiple frames, we introduce a lightweight Time Adapter. This adapter, composed of a multi-layer perceptron, can fuse and compress features from multiple frame tokens, significantly improving computational efficiency while preserving key dynamic information. Through this series of operations, any form of visual input is standardized into a visual token sequence rich in information, Fvis

Papers on Lattice

Total citations

Topics

h-index

Publication activitypapers/week, last 8 weeks

Research focus

Computer Vision (1)Reasoning & Chain-of-Thought (1)Scientific Discovery & Drug Design (1)Tool Use & Agents (1)

Frequent co-authors

Minghe Xu (1)Rouying Wu (1)Jiarui Xu (1)Minhao Sun (1)

Papers (2)

Mar 5, 2026

Mar 5, 2026·also CAS

UniPAR: A Unified Framework for Pedestrian Attribute Recognition

Forget training separate models for each pedestrian attribute dataset – a single Transformer can now handle RGB images, video sequences, and even event streams with comparable accuracy to specialized methods.

Minghe Xu, Rouying Wu, Jiarui Xu +6

Computer Vision

Mar 4, 2026

He Cao +9Mar 4, 2026

Mozi: Governed Autonomy for Drug Discovery LLM Agents

LLMs can navigate massive chemical spaces and enforce toxicity filters in drug discovery, but only if you constrain them with a dual-layer architecture that combines free-form reasoning with structured execution.

He Cao, Siyu Liu, Fan Zhang +7

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Search

Yu Li

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (2)