Ziyong Feng

LLaVA-OV-2's codec-stream tokenization lets it crush existing video-language models, especially in tasks requiring fine-grained temporal understanding of high-frequency motion.

Xiang An, Yin Xie, Feilong Tang +26

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Apr 16, 2026

Shuo Tan +7Apr 16, 2026

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Forget generic retrieval signals – UniDoc-RL uses reinforcement learning to teach LVLMs how to actively perceive and reason about visual information, yielding a 17.7% performance boost.

Shuo Tan, Zelong Sun, Tiancheng Gu +5

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

Feb 9, 2026

Feb 9, 2026·also Imperial

OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

Turns out, skipping the boring parts of a video (like static backgrounds) makes your vision AI both faster and smarter, beating state-of-the-art models with less data.

Feilong Tang, Xiang An, Yunyao Yan +15

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Search

Ziyong Feng

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (4)