UW-MadisonFeb 1, 2026arXiv:2602.00993

HERMES: A Holistic End-to-End Risk-Aware Multimodal Embodied System with Vision-Language Models for Long-Tail Autonomous Driving

Weizhe Tang, Junwei You, Jiaxi Liu, Zhaoyi Wang, Rui Gan, Zilin Huang, Feng Wei, Bin Ran

AI Summary

The paper introduces HERMES, a risk-aware end-to-end autonomous driving framework that integrates vision-language models with explicit long-tail risk cues for improved trajectory planning in complex, mixed-traffic scenarios. HERMES leverages a foundation-model-assisted annotation pipeline to generate structured Long-Tail Scene Context and Long-Tail Planning Context, which are then fused with multi-view perception and historical motion cues in a Tri-Modal Driving Module. Experimental results on a real-world long-tail dataset demonstrate that HERMES outperforms existing end-to-end and VLM-driven approaches, particularly in handling long-tail scenarios.

Key Contribution

Autonomous vehicles can now better navigate complex, rare events by explicitly incorporating long-tail risk cues into end-to-end trajectory planning using a novel multimodal driving framework.

Abstract

End-to-end autonomous driving models increasingly benefit from large vision--language models for semantic understanding, yet ensuring safe and accurate operation under long-tail conditions remains challenging. These challenges are particularly prominent in long-tail mixed-traffic scenarios, where autonomous vehicles must interact with heterogeneous road users, including human-driven vehicles and vulnerable road users, under complex and uncertain conditions. This paper proposes HERMES, a holistic risk-aware end-to-end multimodal driving framework designed to inject explicit long-tail risk cues into trajectory planning. HERMES employs a foundation-model-assisted annotation pipeline to produce structured Long-Tail Scene Context and Long-Tail Planning Context, capturing hazard-centric cues together with maneuver intent and safety preference, and uses these signals to guide end-to-end planning. HERMES further introduces a Tri-Modal Driving Module that fuses multi-view perception, historical motion cues, and semantic guidance, ensuring risk-aware accurate trajectory planning under long-tail scenarios. Experiments on the real-world long-tail dataset demonstrate that HERMES consistently outperforms representative end-to-end and VLM-driven baselines under long-tail mixed-traffic scenarios. Ablation studies verify the complementary contributions of key components.

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References46

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HERMES: A Holistic End-to-End Risk-Aware Multimodal Embodied System with Vision-Language Models for Long-Tail Autonomous Driving

Related Papers