Search papers, labs, and topics across Lattice.
This paper introduces Soft-RLVR, a reinforcement learning framework that uses decomposed, learned verification signals to provide denser partial-credit rewards for tasks with partially verifiable requirements. They formalize the tradeoff between averaging item-level judgments and rewarding incomplete responses, identifying conditions for improved training signals. They also present Soft-SVeRL, a self-verifying variant, showing the need for stabilization to prevent reward inflation, and demonstrate improved instruction following in a controlled setting.
Decomposing complex tasks into verifiable checklists unlocks more effective reinforcement learning, but only if you can avoid the pitfalls of reward hacking and verifier bias.
Reinforcement Learning from Verifiable Rewards (RLVR) has improved language models in domains such as mathematics and code, where correctness can be checked automatically. However, many important tasks are only partially verifiable: prompts contain multiple requirements, responses may satisfy some but not all of them, or no single reference answer might exist. We introduce Soft-RLVR, a framework for reinforcement learning from decomposed, learned verification signals. Soft-RLVR converts each prompt into a checklist of atomic requirements, scores candidate responses item by item with an LLM verifier, and trains on the resulting soft reward. Checklist-based rewards turn sparse pass/fail supervision into a denser partial-credit signal, but they also introduce a tradeoff: averaging item-level judgments can reduce verifier noise, while partial credit can reward incomplete responses. We formalize this tradeoff and identify conditions under which checklist-based verification gives a more reliable RL training signal than holistic verification. We further introduce Soft-SVeRL, a self-verifying variant of Soft-RLVR in which the policy also acts as the verifier. We show that self-verification is prone to reward inflation from overly permissive self-judgments, and that explicit stabilization is needed to prevent this collapse. In a controlled instruction-following setting with rule-based ground-truth evaluation, checklist-based Soft-RLVR improves IFEval by up to 11.1 points using only learned verifier rewards. Our experiments further show that verifier quality and checklist quality both affect downstream RL outcomes, and that explicit stabilization is essential for effective self-verification.