EPFLIdiapManchesterJun 8, 2026arXiv:2606.09449

Reasoning without Gold Standards: A Proxy-Judge Theory of Autoformalization

AI Summary

This paper introduces a reference-free proxy-judge framework for autoformalization (AF), which addresses the challenge of evaluating complex reasoning outputs that lack a single correct reference. By utilizing a vector of per-axis property checks organized into three structural scopes, the framework enables a reflective refinement loop that iteratively improves outputs based on identified violations. The results demonstrate that this approach significantly enhances pass rates across multiple formalization tasks compared to traditional single-shot in-context learning baselines, showcasing the effectiveness of structured proxy judgments in the absence of exact references.

Key Contribution

Structured proxy judgments can refine reasoning outputs without needing exact gold standards, leading to substantial improvements in performance.

Abstract

Complex reasoning tasks increasingly require systems to produce outputs whose correctness cannot be judged by exact match against a single reference. Autoformalization (AF) is a representative example; it asks a model to translate informal mathematical or logical reasoning into a formally checkable object, yet expert-validated formalizations do not scale beyond toy cases and a single informal argument can admit many valid formal renderings. Progress therefore depends on whether partial, structured proxies can substitute for exact references. We introduce a reference-free proxy-judge framework for AF that replaces gold-standard matching with a vector of per-axis property checks. The framework organizes the proxy along three structural scopes that cover global properties of the elicited object, per-module properties internal to its sub-components, and cross-domain properties that re-align it to the informal source, and aggregates each axis into a verdict vector. The vector drives a reflective refinement loop in which a violated coordinate routes the controller to a matching repair target, so each iteration changes only what is judged wrong. Under bounded judge noise, the expected intrinsic gap contracts geometrically to a noise-dependent plateau. Across seven formalization backbones on miniF2F, ProofNet, e-SNLI, and ProntoQA, refinement consistently lifts Pass Rate over the single-shot ICL baseline, and the per-axis proxy outperforms a matched scalar proxy on benchmarks where the baseline has room to improve. Structured proxy judgments therefore provide both a practical refinement signal and a theoretical handle on convergence when exact references are unavailable.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Reasoning without Gold Standards: A Proxy-Judge Theory of Autoformalization

Related Papers