Long Chen

SPG-Layout achieves a breakthrough in 3D scene synthesis by generating physically plausible layouts in non-Manhattan environments, outperforming existing methods.

Xianhui Meng, Zirui Song, Yuchen Zhang +8

Computer Vision Natural Language Processing World Models & Planning

Zhenqi He +53w ago·also HKUST, Xiaomi EV

FlowCIR: Semantic Transport via Flow Matching for Zero-Shot Composed Image Retrieval

FlowCIR slashes training resource requirements by 90% while boosting robustness against negation in zero-shot image retrieval tasks.

Zhenqi He, Ziqi Jiang, Yuanpei Liu +3

Multimodal Models Recommendation & Information Retrieval

Jun 29, 2026

Kien T. Pham +23w ago·also Xiaomi EV

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

AVTok achieves superior audio-video synchronization and reconstruction, setting a new standard for unified multimodal generation.

Kien T. Pham, I Chieh Chen, Long Chen

Multimodal Models Speech & Audio

Jun 25, 2026

Jun 25, 2026·also Xiaomi EV

LISA: Likelihood Score Alignment for Visual-condition Controllable Generation

LISA accelerates training and enhances output quality in visual-condition generation by aligning side network features with likelihood scores, all without extra inference costs.

Yanghao Wang, Hongxu Chen, Jiazhen Liu +2

Computer Vision Multimodal Models Training Efficiency & Optimization

Jun 24, 2026

Xiaomi EVJun 24, 2026·also Amazon Science, Tsinghua AI, Cisco Systems, UMN

Is GraphRAG Needed? From Basic RAG to Graph-/Agentic Solutions with Context Optimization

Advanced RAG methods like GraphRAG and Agentic RAG can reduce token usage by up to 53%, but they don't always enhance generation quality as expected.

Long Chen, Ryan Razkenari, Yuxuan Zhou +4

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Tool Use & Agents

Jun 22, 2026

Jun 22, 2026·also Xiaohongshu, Xiaomi EV

SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Multimodal Models

SPAR bridges the critical gap between semantic perception and pixel-level generation, achieving unprecedented quality in visual outputs without external supervision.

Hongxiang Li, Hongxu Chen, Xiaoshuang Huang +4

Computer Vision Multimodal Models

Jun 8, 2026

Tsinghua AIJun 8, 2026·also CAS, USTC, Xiaomi EV

CP4D: Compositional Physics-aware 4D Scene Generation

CP4D achieves photorealistic 4D scene generation by seamlessly integrating static environments with dynamic objects, outperforming existing methods in visual fidelity and physical consistency.

Hanxin Zhu, Long Chen, Zhibo Chen

Computer Vision World Models & Planning

May 31, 2026

May 31, 2026·also Tsinghua AI, Ant Group, HKUST, PKU +1

OneVLA: A Unified Framework for Embodied Tasks

OneVLA unifies navigation and manipulation tasks into a single framework, enabling robots to seamlessly interpret commands and interact with their environments like never before.

Yingbo Tang, Lei Zhou, Shuyi Zhang +4

Multimodal Models Robotics & Embodied AI

May 27, 2026

May 27, 2026·also Soochow, Xiaomi EV

SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

Fine-grained 3D object grounding gets a boost: SSR3D-LLM uses latent spatial reasoning steps to iteratively refine candidate rankings, outperforming single-pointer methods and setting a new standard for unified 3D-LLMs.

Ziyi Liu, Weijie Shi, Long Chen +2

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

May 6, 2026

May 6, 2026·also Huawei, Xiaomi EV

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

Decoupling radial and angular dynamics in vision-language model adaptation unlocks significant gains in few-shot performance, outperforming existing flow matching methods.

Hongxu Chen, Yanghao Wang, Bowei Zhu +3

Computer Vision Multimodal Models Training Efficiency & Optimization

Apr 20, 2026

Apr 20, 2026·also BAAI, Rimbot, Xiaomi EV

XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

Endowing VLMs with intrinsic 3D geometric awareness and physical interaction cues via XEmbodied substantially boosts performance on spatial reasoning and embodied tasks, surpassing existing 2D image-text pretrained models.

Kangan Qian, ChuChu Xie, Yang Zhong +11

Computer Vision Multimodal Models Robotics & Embodied AI

Jinghui Lu +51Apr 20, 2026·also CAS, Drive. We further evaluate zero-shot, HKU, NJU +2

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Latent reasoning can beat explicit Chain-of-Thought – but only if you force it to learn causal dynamics via a visual world model, not just language.

Jinghui Lu, Jiayi Guan, Zhijian Huang +49

Multimodal Models Reasoning & Chain-of-Thought World Models & Planning

Apr 19, 2026

Apr 19, 2026·also BIT, UMacau, Xiaomi EV

Think before Go: Hierarchical Reasoning for Image-goal Navigation

Image-goal navigation gets a boost from hierarchical reasoning, using vision-language models for high-level planning and online RL for low-level execution, significantly reducing wandering and improving success in complex environments.

Shaoqing Xu, Fang Li, Lin Zhao +2

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 14, 2026

Apr 14, 2026·also UMacau, Xiaomi EV

Spatial-Spectral Adaptive Fidelity and Noise Prior Reduction Guided Hyperspectral Image Denoising

Achieve state-of-the-art hyperspectral image denoising by adaptively balancing data fidelity and noise priors, outperforming existing methods that overemphasize image priors.

Xuelin Xie, Xiliang Lu, Zhengshan Wang +1

Computer Vision

Apr 2, 2026

Apr 2, 2026·also Xiaomi EV

SteerFlow: Steering Rectified Flows for Faithful Inversion-Based Image Editing

Achieve high-fidelity image editing without sacrificing source fidelity by straightening the latent trajectory and adaptively blending source and target velocities.

T. Dao, Zhen Wang, Kien T.Pham +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Yongkang Li +14Apr 2, 2026·also Drive. We further evaluate zero-shot, Xiaomi EV

UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

Autonomous driving models no longer need to compromise between spatial perception and semantic reasoning: UniDriveVLA's expert decoupling unlocks state-of-the-art performance across a range of driving tasks.

Yongkang Li, Lijun Zhou, Sixu Yan +12

Multimodal Models Robotics & Embodied AI World Models & Planning

Apr 1, 2026

Sicheng Zuo +8Apr 1, 2026·also Xiaomi EV

DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

Ditch language descriptions: this new driving model leverages dense 3D geometry for superior autonomous driving performance and cross-camera generalization.

Sicheng Zuo, Zixun Xie, Wenzhao Zheng +6

Multimodal Models Robotics & Embodied AI World Models & Planning

Search

Long Chen

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (18)