IIT DelhiMar 2, 2026arXiv:2603.01580

Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models

AI Summary

The paper introduces MarODE, a novel offline evaluation framework for assessing the quality of reasoning traces generated by language models by assigning quality scores that correlate with human judgment. MarODE models reasoning progression as a Markov process and characterizes trace dynamics using ordinary differential equations, enabling efficient evaluation. Empirical results demonstrate that MarODE significantly outperforms existing baselines in correlating with human evaluations of reasoning quality, achieving over 250% improvement in Somers' D correlation.

Key Contribution

Theory-driven evaluation of reasoning traces can achieve 2.5x better correlation with human judgments than existing methods, offering a more reliable way to assess reasoning quality in language models.

Abstract

Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers' D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models

Related Papers