Search papers, labs, and topics across Lattice.
This study introduces PaSBench-Video, a novel benchmark designed to evaluate the proactive safety warning capabilities of video-capable multimodal large language models (MLLMs) across four domains: driving, healthcare, daily life, and industrial production. The benchmark consists of 740 videos annotated for risk onset and accident boundaries, enabling a nuanced assessment of models' ability to issue timely and accurate warnings. Results reveal that no tested MLLM surpassed a 20.0% success rate on the strictest metric, highlighting a significant challenge in balancing recall and false-positive rates, particularly in distinguishing hazardous from non-hazardous scenes in driving contexts.
No current MLLM can reliably issue timely safety warnings, with performance sharply varying across domains and a troubling trade-off between recall and false positives.
Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do not test this ability: they rely on static inputs, ignore timing precision, and omit false-positive measurement on safe scenes. We present PaSBench-Video, a 740-video benchmark with 481 risk and 259 no-risk videos across four domains: driving, healthcare, daily life, and industrial production. Risk videos are annotated with frame-level risk onset and accident boundaries. A model must observe the video causally and produce a warning that is both temporally calibrated and content-correct. Testing 13 MLLMs, we find that no model exceeds 20.0% on our strictest metric, and recall is tightly coupled with false-positive rate, with Pearson correlation 0.64: higher detection comes only at the cost of triggering warnings on the majority of safe clips. Performance splits sharply by domain: models achieve moderate recall at low false-positive rates in daily life, where risks are inherently anomalous, yet fire indiscriminately in driving, where routine and hazardous scenes look alike. These results indicate that current models rely on scene-level activity cues rather than reasoning about emerging harm.