BaiduBITApr 15, 2026arXiv:2604.13954

HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark

AI Summary

The paper introduces the concept of "intrinsic risk" in autonomous agents, where failures arise from latent issues propagating over long horizons even in benign environments. To evaluate this, they present HINTBench, a benchmark of 629 agent trajectories annotated for risk detection, risk-step localization, and failure-type identification. Experiments show that while LLMs can detect trajectory-level risk, they struggle with precise risk localization and fine-grained failure diagnosis, highlighting a significant gap in current agent safety capabilities.

Key Contribution

Even strong LLMs struggle to pinpoint the exact moment and cause of failure in risky agent trajectories arising from latent, intrinsic issues, achieving below 35 Strict-F1 on risk-step localization.

Abstract

Existing agent-safety evaluation has focused mainly on externally induced risks. Yet agents may still enter unsafe trajectories under benign conditions. We study this complementary but underexplored setting through the lens of \emph{intrinsic} risk, where intrinsic failures remain latent, propagate across long-horizon execution, and eventually lead to high-consequence outcomes. To evaluate this setting, we introduce \emph{non-attack intrinsic risk auditing} and present \textbf{HINTBench}, a benchmark of 629 agent trajectories (523 risky, 106 safe; 33 steps on average) supporting three tasks: risk detection, risk-step localization, and intrinsic failure-type identification. Its annotations are organized under a unified five-constraint taxonomy. Experiments reveal a substantial capability gap: strong LLMs perform well on trajectory-level risk detection, but their performance drops to below 35 Strict-F1 on risk-step localization, while fine-grained failure diagnosis proves even harder. Existing guard models transfer poorly to this setting. These findings establish intrinsic risk auditing as an open challenge for agent safety.

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark

Related Papers