Ping Nie

Generators can dramatically improve their performance on long-tailed visual requests by leveraging a teach-then-search co-training approach, overcoming a critical knowledge boundary.

Haozhe Wang, Weijia Feng, Jinpeng Yu +7

Eval Frameworks & Benchmarks Multimodal Models

Jun 12, 2026

Jun 12, 2026·also SJTU, Texas A&M, UCSD, UofT +1

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

DR-DCI achieves a remarkable 73.3% accuracy in agentic search tasks while efficiently scaling from 100K to 10M documents, outperforming traditional methods.

Yi Lu, Zhuofeng Li, Ping Nie +6

Recommendation & Information Retrieval Tool Use & Agents

Apr 9, 2026

CMU MLApr 9, 2026·also Tsinghua AI, NJU, Waterloo

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Today's best AI agents can only complete 33% of common online tasks like booking appointments or filling out job applications, revealing a significant gap between current capabilities and real-world utility.

Yuxuan Zhang, Yubo Wang, Yipeng Zhu +19

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Apr 6, 2026

Watch Before You Answer: Learning from Visually Grounded Post-Training

Current video understanding benchmarks and post-training datasets are riddled with linguistic biases, meaning VLMs might be acing tests without actually "watching" the video.

Eunjeong Hwang, Huaisong Zhang, Penghui Du +7

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Mar 29, 2026

Samin Mahdizadeh Sani +22Mar 29, 2026·also Texas A&M, University of Tehran, Waterloo

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

Image generation models ace photorealistic art but still choke on screenshots and infographics, highlighting a critical gap in real-world applicability.

Samin Mahdizadeh Sani, Max Ku, Nima Jamali +20

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Mar 17, 2026

Mar 17, 2026·also Corresponding Author, NJU, Waterloo

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

A Qwen3-8B model, trained with a new SFT+RLAIF recipe on a challenging new benchmark, SWE-QA-Pro, beats GPT-4o in repository-level code understanding.

Songcheng Cai, Z. Lyu, Yuansheng Ni +14

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Feb 9, 2026

Jiarong Liang +4Feb 9, 2026

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

MLLMs may ace your visual question answering, but VisPhyWorld reveals they're still struggling to actually *simulate* physics.

Jiarong Liang, Max W.F. Ku, Ka-Hei Hui +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought