Apr 28, 2026arXiv:2604.25345

Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows

AI Summary

The paper evaluates CMBAgent, an agentic AI system, on astrophysical workflows in one-shot and deep research settings. They found that providing domain-specific context significantly improves performance in the one-shot setting, but the agent still exhibits silent failures, generating plausible but incorrect results, particularly when reasoning limits are probed. The most concerning failure mode is the confident generation of incorrect results without self-diagnosis, especially in the deep research setting where physically inconsistent posteriors are produced.

Key Contribution

Agentic AI systems can confidently generate plausible but wrong scientific results, even when given domain-specific context, highlighting a critical challenge for their integration into research workflows.

Abstract

Agentic AI systems are increasingly being integrated into scientific workflows, yet their behavior under realistic conditions remains insufficiently understood. We evaluate CMBAgent across two workflow paradigms and eighteen astrophysical tasks. In the One-Shot setting, access to domain-specific context yields an approximately ~6x performance improvement (0.85 vs. ~0 without context), with the primary failure mode being silent incorrect computation - syntactically valid code that produces plausible but inaccurate results. In the Deep Research setting, the system frequently exhibits silent failures across stress tests, producing physically inconsistent posteriors without self-diagnosis. Overall, performance is strong on well-specified tasks but degrades on problems designed to probe reasoning limits, often without visible error signals. These findings highlight that the most concerning failure mode in agentic scientific workflows is not overt failure, but confident generation of incorrect results. We release our evaluation framework to facilitate systematic reliability analysis of scientific AI agents.

Code Generation & Program Synthesis Scientific Discovery & Drug Design Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References27

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows

Related Papers