Search papers, labs, and topics across Lattice.
This paper explores the feasibility of using Vision-Language Models (VLMs) as a "semantic observer layer" for autonomous vehicles, designed to detect context-dependent hazards missed by pixel-level detectors. They achieve a ~50x speedup by quantizing Nvidia Cosmos-Reason1-7B to NVFP4 and using FlashAttention2, reaching ~500ms inference time. The study identifies NF4 quantization recall collapse as a key deployment challenge and maps performance metrics to safety goals, demonstrating pre-deployment feasibility.
A 50x speedup makes VLMs fast enough to serve as a real-time semantic safety net for self-driving cars, but NF4 quantization can cause critical recall failures.
Semantic anomalies-context-dependent hazards that pixel-level detectors cannot reason about-pose a critical safety risk in autonomous driving. We propose a \emph{semantic observer layer}: a quantized vision-language model (VLM) running at 1--2\,Hz alongside the primary AV control loop, monitoring for semantic edge cases, and triggering fail-safe handoffs when detected. Using Nvidia Cosmos-Reason1-7B with NVFP4 quantization and FlashAttention2, we achieve ~500 ms inference a ~50x speedup over the unoptimized FP16 baseline (no quantization, standard PyTorch attention) on the same hardware--satisfying the observer timing budget. We benchmark accuracy, latency, and quantization behavior in static and video conditions, identify NF4 recall collapse (10.6%) as a hard deployment constraint, and a hazard analysis mapping performance metrics to safety goals. The results establish a pre-deployment feasibility case for the semantic observer architecture on embodied-AI AV platforms.