Feb 15, 2026arXiv:2602.14012

From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection

AI Summary

This paper investigates the post-training pipeline for Large Language Model (LLM)-based vulnerability detection (VD), comparing cold-start Supervised Fine-Tuning (SFT), off-policy preference optimization, and on-policy Reinforcement Learning (RL). The study reveals that SFT with rejection sampling outperforms rationalization-based supervision, excessive SFT hinders RL self-exploration, and fine-grained reward signals in RL are crucial for reliable credit assignment. The results demonstrate that models trained with on-policy RL using a Generative Reward Policy Optimization (GRPO) approach significantly outperform SFT and preference optimization methods, as well as zero-shot LLMs, highlighting the potential of on-policy RL for VD.

Key Contribution

On-policy RL (GRPO) makes LLMs significantly better at vulnerability detection than SFT or preference optimization, outperforming even strong zero-shot baselines.

Abstract

The integration of LLMs into vulnerability detection (VD) has shifted the field toward interpretable and context-aware analysis. While post-training methods have shown promise in general coding tasks, their systematic application to VD remains underexplored. In this paper, we present the first comprehensive investigation into the post-training pipeline for LLM-based VD, spanning from cold-start SFT to off-policy preference optimization and on-policy RL, uncovering how data curation, stage interactions, reward mechanisms, and evaluation protocols collectively dictate the efficacy of model training and assessment. Our study identifies practical guidelines and insights: (1) SFT based on rejection sampling greatly outperforms rationalization-based supervision, which can introduce hallucinations due to ground-truth leakage. (2) While increased SFT epochs constantly benefit preference optimization, excessive SFT inhibits self-exploration during RL, ultimately limiting performance gains. (3) Coarse-grained reward signals often mislead RL, whereas fine-grained root-cause judgments ensure reliable credit assignment. Specification-based rewards offer further benefits but incur significant effort in specification generation. (4) Although filtering extremely hard-to-detect vulnerability samples improves RL training efficiency, the cost of performance loss should be considered in practical applications. (5) Models trained under GRPO significantly outperform those using SFT and preference optimization (i.e., DPO and ORPO), as well as a series of zero-shot SOTA LLMs, underscoring the significant potential of on-policy RL for LLM-based VD. (6) In contrast to binary matching that tends to overestimate performance, LLM-as-a-Judge based on root-cause analysis provides a more robust evaluation protocol, although its accuracy varies across judge models with different levels of security expertise.

Code Generation & Program Synthesis Data Curation & Synthetic Data RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection

Related Papers