HKUOPPOSJTUUSTCMar 2, 2026arXiv:2603.01493

PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval

Tianyi Xu, Rong Shan, Junjie Wu, Jiadeng Huang, Teng Wang, Wenteng Chen, Minxin Tu, Quantao Dou, Zhaoxiang Wang, Changwang Zhang, Jianghao Lin

AI Summary

The paper introduces PhotoBench, a new benchmark for personalized photo retrieval designed to evaluate multi-source, intent-driven reasoning using real-world personal photo albums. PhotoBench uses a multi-source profiling framework integrating visual semantics, spatio-temporal metadata, social identity, and temporal events to synthesize complex queries reflecting users' life trajectories. Experiments using PhotoBench reveal limitations in existing models, specifically a modality gap where unified embeddings fail on non-visual constraints and a source fusion paradox where agentic systems struggle with tool orchestration.

Key Contribution

Personal photo retrieval isn't just about visual similarity; PhotoBench reveals that current models fail to leverage the rich context of our lives—time, place, people—needed to truly understand our search intent.

Abstract

Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated web snapshots, failing to capture the multi-source reasoning required to resolve authentic, intent-driven user queries. To bridge this gap, we introduce PhotoBench, the first benchmark constructed from authentic, personal albums. It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning. Based on a rigorous multi-source profiling framework, which integrates visual semantics, spatial-temporal metadata, social identity, and temporal events for each image, we synthesize complex intent-driven queries rooted in users' life trajectories. Extensive evaluation on PhotoBench exposes two critical limitations: the modality gap, where unified embedding models collapse on non-visual constraints, and the source fusion paradox, where agentic systems perform poor tool orchestration. These findings indicate that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion. Our PhotoBench is available.

Computer Vision Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval

Related Papers