Search papers, labs, and topics across Lattice.
The paper introduces HyTuning, a hybrid post-training framework that adaptively combines Reasoning Distillation (RD) and Reinforcement Learning from Internal Feedback (RLIF) to improve both accuracy and confidence faithfulness in LLMs for high-stakes tasks. HyTuning uses Progressive Reasoning Gain (PRG) to measure the progressive support for the final answer within reasoning traces, allowing for adaptive reweighting of RD and RLIF. Experiments on domain-specific and general benchmarks show that HyTuning improves accuracy and confidence faithfulness, even with limited supervised reasoning traces.
LLMs can be made more accurate *and* more trustworthy with a clever post-training method that selectively amplifies only the reasoning steps that progressively build confidence in the correct answer.
Large language models are increasingly deployed in high-stakes tasks, where confident yet incorrect inferences may cause severe real-world harm, bringing the previously overlooked issue of confidence faithfulness back to the forefront. A promising solution is to jointly optimize unsupervised Reinforcement Learning from Internal Feedback (RLIF) with reasoning-trace-guided Reasoning Distillation (RD), which may face three persistent challenges: scarcity of high-quality training corpora, factually unwarranted overconfidence and indiscriminate fusion that amplifies erroneous updates. Inspired by the human confidence accumulation from uncertainty to certainty, we propose Progressive Reasoning Gain (PRG) to measure whether reasoning steps progressively strengthen support for the final answer. Furthermore, we introduce HyTuning, a hybrid post-training framework that adaptively reweights RD and RLIF via a PRG-style metric, using scarce supervised reasoning traces as a stable anchor while exploiting abundant unlabeled queries for scalability. Experiments on several domain-specific and general benchmarks demonstrate that HyTuning improves accuracy while achieving confidence faithfulness under limited supervision, supporting a practical"Less Approximates More"effect.