CorrespondenceHunanApr 6, 2026arXiv:2604.04834

E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

Jiajun Zhai, Jiajun Zhai, Haowen Shi, Hao Shi, Shangwei Guo, Kailun Yang, Kailun Yang, Kaiwei Wang, Kaiwei Wang

AI Summary

The paper introduces E-VLA, an event-augmented Vision-Language-Action model designed to improve robotic manipulation robustness in challenging lighting and motion blur conditions where traditional frame-based vision fails. E-VLA directly utilizes motion and structural cues from event streams, avoiding image reconstruction, to maintain semantic perception and action consistency. Experiments on a new real-world RGB-event-action dataset demonstrate that simple event fusion techniques, like overlaying event maps, significantly boost task success rates in dark and blurred environments, achieving up to 90% success in low-light pick-and-place tasks.

Key Contribution

Even simple event-based augmentation can rescue vision-language-action models from complete failure in low-light or high-blur scenarios, boosting task success from 0% to 90% in some cases.

Abstract

Robotic Vision-Language-Action (VLA) models generalize well for open-ended manipulation, but their perception is fragile under sensing-stage degradations such as extreme low light, motion blur, and black clipping. We present E-VLA, an event-augmented VLA framework that improves manipulation robustness when conventional frame-based vision becomes unreliable. Instead of reconstructing images from events, E-VLA directly leverages motion and structural cues in event streams to preserve semantic perception and perception-action consistency under adverse conditions. We build an open-source teleoperation platform with a DAVIS346 event camera and collect a real-world synchronized RGB-event-action manipulation dataset across diverse tasks and illumination settings. We also propose lightweight, pretrained-compatible event integration strategies and study event windowing and fusion for stable deployment. Experiments show that even a simple parameter-free fusion, i.e., overlaying accumulated event maps onto RGB images, could substantially improve robustness in dark and blur-heavy scenes: on Pick-Place at 20 lux, success increases from 0% (image-only) to 60% with overlay fusion and to 90% with our event adapter; under severe motion blur (1000 ms exposure), Pick-Place improves from 0% to 20-25%, and Sorting from 5% to 32.5%. Overall, E-VLA provides systematic evidence that event-driven perception can be effectively integrated into VLA models, pointing toward robust embodied intelligence beyond conventional frame-based imaging. Code and dataset will be available at https://github.com/JJayzee/E-VLA.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References67

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

Related Papers