SEUShanghai InnovationZJUApr 28, 2026arXiv:2604.25444

One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

Dongzhou Cheng, Zhiliang Wu, Yi Yang, Hehe Fan

AI Summary

The paper introduces ReQueR, a framework that uses reinforcement learning to train a "Refiner" policy that rewrites ambiguous queries into explicit logical decompositions to elicit better reasoning from frozen LLMs. To stabilize training, they propose an Adaptive Solver Hierarchy, a curriculum mechanism that dynamically aligns environmental difficulty with the Refiner's competence. ReQueR achieves consistent performance gains across diverse architectures and benchmarks, demonstrating the potential for a single Refiner to unlock reasoning in diverse unseen models.

Key Contribution

Forget fine-tuning every LLM: ReQueR trains a single, RL-powered query refiner that coaxes hidden reasoning abilities out of diverse, frozen models at inference time.

Abstract

Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive $O(N)$ costs by fine-tuning each model individually or rely on static prompts that fail to resolve query-level structural complexity. In this paper, we propose ReQueR (\textbf{Re}inforcement \textbf{Que}ry \textbf{R}efinement), a modular framework that treats reasoning elicitation as an inference-time alignment task. We train a specialized Refiner policy via Reinforcement Learning to rewrite raw queries into explicit logical decompositions, treating frozen LLMs as the environment. Rooted in the classical Zone of Proximal Development from educational psychology, we introduce the Adaptive Solver Hierarchy, a curriculum mechanism that stabilizes training by dynamically aligning environmental difficulty with the Refiner's evolving competence. ReQueR yields consistent absolute gains of 1.7\%--7.2\% across diverse architectures and benchmarks, outperforming strong baselines by 2.1\% on average. Crucially, it provides a promising paradigm for one-to-many inference-time reasoning elicitation, enabling a single Refiner trained on a small set of models to effectively unlock reasoning in diverse unseen models. Code is available at https://github.com/newera-xiao/ReQueR.

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

Related Papers