Search papers, labs, and topics across Lattice.
iAgentBench, a new open-domain question answering benchmark, is introduced to evaluate the sensemaking capabilities of information-seeking agents, focusing on tasks requiring cross-source evidence integration. The benchmark leverages real-world high-traffic topics and common user intent patterns to generate realistic questions that cannot be answered by single-passage retrieval. Experiments with multiple LLMs demonstrate that while retrieval enhances accuracy, it's insufficient for fully resolving the questions, highlighting the importance of evaluating evidence utilization.
Existing QA benchmarks are too easy for LLMs, so iAgentBench offers a more realistic challenge by requiring agents to synthesize information from multiple sources on high-traffic topics.
With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis. Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, underscoring the need to evaluate evidence use, not just evidence access.