University of ScienceAug 31, 2025arXiv:2509.00751

EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions

Dinh-Khoi Vo, Van-Loc Nguyen, Minh-Triet Tran, Trung-Nghia Le

AI Summary

This paper addresses the challenge of event-based image retrieval from free-form captions by proposing a multi-stage retrieval framework that integrates dense article retrieval using Qwen3, event-aware language model reranking with Qwen3-Reranker, and precise image scoring using Qwen2-VL. The approach leverages language-based reasoning and multimodal retrieval to understand latent event semantics, context, and real-world knowledge expressed in complex captions. The system achieves state-of-the-art performance, specifically the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge.

Key Contribution

EVENT-Retriever demonstrates that combining language-based reasoning with multimodal retrieval significantly improves performance in complex, real-world image understanding tasks, achieving top-1 score in the EVENTA 2025 Grand Challenge.

Abstract

Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating the effectiveness of combining language-based reasoning and multimodal retrieval for complex, real-world image understanding. The code is available at https://github.com/vdkhoi20/EVENT-Retriever.

Computer Vision Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References34

Year2025

VenueACM Multimedia

Related Papers

Finding related papers...

Search

EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions

Related Papers