Mar 3, 2026arXiv:2603.02788

Agentified Assessment of Logical Reasoning Agents

AI Summary

This paper introduces an agentified assessment framework for evaluating logical reasoning agents, emphasizing reproducibility and robustness. An assessor agent manages task issuance, resource constraints, output parsing, and failure logging, interacting with the agent under test through a standardized interface. Benchmarking an auto-formalization agent on a cleaned FOLIO dataset, the framework demonstrates 86.70% accuracy in translating natural language to executable Z3Py programs for SMT-based entailment checking, surpassing a chain-of-thought baseline.

Key Contribution

Standardized agent-to-agent interfaces can enable reproducible and auditable benchmarks for logical reasoning, revealing that auto-formalization agents can outperform chain-of-thought reasoning on FOLIO.

Abstract

We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types, while the agent under test only needs to expose a standardized agent-to-agent interface. As a case study, we benchmark an auto-formalization agent for first-order logic (FOL) reasoning on a solver-verified and repaired split of FOLIO. The agent translates natural language premises and conclusions into executable Z3Py programs and employs satisfiability modulo theories (SMT) solving to determine logical entailment. On the cleaned FOLIO validation set, the auto-formalization agent achieves 86.70% accuracy under the assessor protocol, outperforming a chain-of-thought baseline (73.89%).

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Agentified Assessment of Logical Reasoning Agents

Related Papers