Search papers, labs, and topics across Lattice.
This study investigates the alignment preservation of instruction-tuned large language models (LLMs) when they are converted into reasoning models through various post-training methods. The authors conduct a trustworthiness audit across six dimensions, revealing that while reasoning performance may improve, alignment behaviors such as safety and bias avoidance often regress, leading to increased toxicity and contextual privacy leakage. The findings underscore the necessity of evaluating trustworthiness metrics alongside reasoning capabilities to ensure that model improvements do not compromise ethical standards.
Reasoning models may boost performance but often sacrifice critical alignment behaviors, revealing a hidden trade-off in AI safety.
Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the alignment behavior of the instruction-tuned model, such as safe refusal, bias avoidance, and privacy protection. We ask: does this conversion preserve alignment? We study this question through a trustworthiness audit and find that it is not behavior-preserving by default. For a systematic analysis, we compare reasoning models produced via supervised fine-tuning, RL-based post-training, and distillation against matched instruction-tuned baselines across six trustworthiness dimensions: safety, toxicity, stereotyping and bias, machine ethics, privacy, and out-of-distribution robustness. We observe that reasoning models often improve on reasoning benchmarks but exhibit alignment regressions, including increased toxicity, amplified stereotyping, miscalibrated refusal, and contextual privacy leakage. These regressions are consistent with behavioral drift from the instruction-tuned baseline, measured by KL divergence. Overall, our results point to the broader conclusion that trustworthiness metrics are essential for evaluating reasoning models and should be reported alongside gains in reasoning capability.