Feb 23, 2026arXiv:2602.20424

Implicit Intelligence -- Evaluating Agents on What Users Don't Say

Ved Sirdeshmukh, Ved Sirdeshmukh, Marc Wetter, Marc Wetter

AI Summary

The paper introduces "Implicit Intelligence," a benchmark to evaluate AI agents' ability to infer unstated user requirements related to accessibility, privacy, safety, and context, moving beyond explicit instruction-following. They create "Agent-as-a-World" (AaW), a simulation harness using language models to simulate interactive environments defined in YAML, enabling the testing of agents in scenarios with simple requests but hidden complexities. Experiments across 205 scenarios with 16 models reveal a significant performance gap, with the best model achieving only a 48.3% pass rate, highlighting the need for improved contextual reasoning in AI agents.

Key Contribution

Despite advances in instruction following, current AI agents still struggle to infer implicit user needs related to safety, privacy, and accessibility, highlighting a critical gap in real-world applicability.

Abstract

Real-world requests to AI agents are fundamentally underspecified. Natural human communication relies on shared context and unstated constraints that speakers expect listeners to infer. Current agentic benchmarks test explicit instruction-following but fail to evaluate whether agents can reason about implicit requirements spanning accessibility needs, privacy boundaries, catastrophic risks, and contextual constraints. We present Implicit Intelligence, an evaluation framework testing whether AI agents can move beyond prompt-following to become genuine goal-fulfillers, paired with Agent-as-a-World (AaW), a harness where interactive worlds are defined in human-readable YAML files and simulated by language models. Our scenarios feature apparent simplicity in user requests, hidden complexity in correct solutions, and discoverability of constraints through environmental exploration. Evaluating 16 frontier and open-weight models across 205 scenarios, we find that even the best-performing model achieves only 48.3% scenario pass rate, revealing substantial room for improvement in bridging the gap between literal instruction-following and human-like contextual reasoning.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References32

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Implicit Intelligence -- Evaluating Agents on What Users Don't Say

Related Papers