Jun 9, 2026arXiv:2606.10431

Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems

Shuangchun Gui, Zhiguang Cao, Wen Song, Yew-Soon Ong

AI Summary

This paper introduces a vision-assisted foundation model (VaFM) that integrates patch-level semantics from images with graph-based representations to solve multi-task vehicle routing problems (VRPs) that involve diverse customer constraints. By addressing key challenges such as the lack of constraint representations in existing VRP images, the fixed receptive field of patches, and imbalanced pixel distributions, VaFM effectively optimizes routing costs across 16 different VRP variants. The experimental results reveal that VaFM significantly outperforms state-of-the-art methods, particularly for variants with complex constraints, highlighting its potential for enhancing efficiency in various industries.

Key Contribution

VaFM outperforms traditional methods by effectively integrating visual semantics into vehicle routing, addressing complex constraints that were previously overlooked.

Abstract

Multi-task vehicle routing problems play a critical role in enhancing efficiency across various industries and service sectors. These problems consist of multiple variants that optimize routing costs while meeting diverse customer constraints. Existing multi-task VRP solvers solely utilize a graph-based modality, limiting their ability to address variants with multiple constraints. As a format to represent complex semantics, vision modality shows great potential for encoding diverse VRP constraints. This motivates us to learn patch-level semantics from the vision images, and then integrate them into a graph-based model to solve various VRP variants simultaneously. However, directly applying this approach to multi-task VRPs presents three challenges: 1) existing VRP images lack constraint representations, which are essential for multi-task VRPs, 2) the fixed receptive field of individual patches cannot effectively accommodate varying requirements across tasks, and 3) imbalanced pixel distribution among constraints may cause the model to overlook constraints with fewer pixels. In this paper, we propose a vision-assisted foundation model (VaFM) to address these challenges. In the vision modality, input images tailored to all constraints are encoded by a convolutional neural network. The obtained patch embeddings are fused with graph-based nodes to generate solutions, with an auxiliary task designed to address the pixel-imbalanced issue. The performance of VaFM is evaluated across 16 different VRP variants. The experimental results demonstrate the superiority of VaFM over state-of-the-art methods, especially for variants with complex constraints.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems

Related Papers