Yu Li

D plane. For video or event sequences, a temporal position embedding, EtempE_{\text{temp}}, is additionally incorporated to capture the sequential order of the frames. To maintain modality differentiability within the unified feature space, tokens from auxiliary modalities, such as event streams, are marked with a unique modality type embedding, EmodE_{\text{mod}}. Furthermore, to efficiently handle temporal inputs with multiple frames, we introduce a lightweight Time Adapter. This adapter, composed of a multi-layer perceptron, can fuse and compress features from multiple frame tokens, significantly improving computational efficiency while preserving key dynamic information. Through this series of operations, any form of visual input is standardized into a visual token sequence rich in information, Fvis

Papers on Lattice

Total citations

Topics

h-index

Frequent co-authors

Zheng Liu (1)Honglin Lin (1)Chonghan Qin (1)Xiaoyang Wang (1)

Papers (1)

Jan 20, 2026

Jan 20, 2026·also PKU, RUC, Shanghai AI Lab

ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch

Forget simplistic synthetic data: ChartVerse generates complex charts and reliable reasoning data from scratch, enabling an 8B model to outperform its 30B teacher in chart reasoning.

Zheng Liu, Honglin Lin, Chonghan Qin +13

Search

Yu Li

Frequent co-authors

Papers (1)