Caiming Xiong

Papers on Lattice

Total citations

Topics

h-index

Publication activitypapers/week, last 8 weeks

Research focus

Tool Use & Agents (3)Multimodal Models (2)Eval Frameworks & Benchmarks (2)Computer Vision (1)

Frequent co-authors

Silvio Savarese (3)Honglu Zhou (2)Ran Xu (2)Juan Carlos Niebles (2)

Papers (5)

Jul 13, 2026

1w ago·also Stanford HAI, Salesforce AI

Evidence-Backed Video Question Answering

Video LLMs can significantly improve their QA performance by integrating spatio-temporal evidence, bridging the gap between accuracy and visual perception.

Shijie Wang, Shijie Wang, Honglu Zhou +13

Computer Vision Multimodal Models

Jun 11, 2026

Jun 11, 2026·also PKU, Tencent AI, USTC

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

LLM agents struggle in dynamic environments, but EvoMem boosts their performance by capturing the evolution of memory, leading to better adaptability.

Jundong Xu, Qingchuan Li, Qingchuan Li +21

Eval Frameworks & Benchmarks Tool Use & Agents World Models & Planning

Apr 23, 2026

Q. Han +13Apr 23, 2026·also UC Santa Cruz

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

VLAA-GUI's innovative framework allows autonomous agents to not only verify their success but also adaptively recover from failures, achieving human-level performance in GUI tasks.

Q. Han, Haoqin Tu, Zijun Wang +11

Eval Frameworks & Benchmarks Tool Use & Agents

Mar 5, 2026

NVIDIAMar 5, 2026·also Salesforce AI

Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

Forget slow, end-to-end models: building real-time voice agents hinges on a cascaded streaming pipeline, as demonstrated by a new tutorial achieving sub-second latency.

Jielin Qiu, Zixiang Chen, Liangwei Yang +11

Open-Source Models & Weights Speech & Audio Tool Use & Agents

Jan 15, 2026

Jan 15, 2026·also Stanford HAI, UNC, V baseline from CogVideoX [98]

Future Optical Flow Prediction Improves Robot Control&Video Generation

A unified Vision-Language Model and Diffusion architecture unlocks surprisingly effective optical flow forecasting from noisy web data, enabling language-conditioned robot control and video generation.

Kanchana Ranasinghe, Honglu Zhou, Yu Fang +7

Multimodal Models Robotics & Embodied AI