Search papers, labs, and topics across Lattice.
This paper introduces Humanity's Last Line of Verification (HLL), a benchmark designed to assess whether multimodal agents can effectively perform tasks that typically require human verification, specifically through interactive CAPTCHA challenges. The evaluation of eight advanced multimodal agents reveals significant performance variability across different CAPTCHA types, with agents struggling particularly under realistic conditions and when required to provide action traces for their answers. These findings highlight critical weaknesses in current agents' capabilities, particularly in localization, action calibration, and process consistency, underscoring the challenges of deploying AI in environments designed to prevent automation.
Current multimodal agents fail to consistently pass CAPTCHA tests, revealing fundamental limitations in their ability to replace humans in automated workflows.
Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions. We introduce \textbf{Humanity's Last Line of Verification (HLL)}, a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross this boundary through grounded, human-like interaction rather than recognition alone. HLL covers diverse CAPTCHA interactions and exposes agents to controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation of the solving process. We evaluate eight frontier multimodal agents in a closed-loop GUI environment. The results show that current agents remain brittle at this human-substitution boundary: performance varies sharply across verification types, degrades under realistic interface conditions, and drops further when correct answers must be supported by valid action traces. By exposing gaps in localization, action calibration, state tracking, and process consistency, HLL provides a concrete testbed for measuring how close multimodal agents are to acting as human substitutes in protected real-world workflows. Our code is available at https://github.com/XinhaoS0101/HLL