Search papers, labs, and topics across Lattice.
The paper introduces reusability and verifiability as novel metrics to evaluate the quality of Chain-of-Thought (CoT) reasoning, beyond just task accuracy, in multi-agent information retrieval pipelines. They propose a Thinker-Executor framework to decouple CoT generation and execution, enabling the measurement of how easily an Executor can reuse the Thinker's CoT (reusability) and match the Thinker's answer using the CoT (verifiability). Experiments across five benchmarks with four Thinker models and ten Executor models demonstrate that reusability and verifiability are not correlated with standard accuracy, and specialized reasoning models do not consistently produce more reusable or verifiable CoTs than general-purpose LLMs.
Leaderboards focused on accuracy alone miss crucial aspects of Chain-of-Thought reasoning: reusability and verifiability.
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker-Executor framework. Reusability measures how easily an Executor can reuse the Thinker's CoT. Verifiability measures how frequently an Executor can match the Thinker's answer using the CoT. We evaluated four Thinker models against a committee of ten Executor models across five benchmarks. Our results reveal that reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current accuracy-based leaderboards for reasoning capability. Surprisingly, we find that CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general-purpose LLMs like Llama and Gemma.