Feb 24, 2026arXiv:2602.21189

Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Anas Barakat, Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Khushbu Pahwa, Amrit Singh Bedi, A. S. Bedi

AI Summary

This paper investigates the degradation of Pass@1 when optimizing for Pass@k in LLMs, a phenomenon observed in inference-aware fine-tuning. It provides a theoretical framework demonstrating that Pass@k optimization can reduce Pass@1 due to gradient conflict arising from prompt interference, specifically when Pass@k optimization implicitly reweights prompts toward low-success prompts that negatively interfere with Pass@1. Empirical validation is provided through experiments on mathematical reasoning tasks.

Key Contribution

Optimizing LLMs for generating multiple attempts (pass@k) can actually *hurt* their ability to get it right on the first try (pass@1) due to subtle prompt interference effects.

Abstract

Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of $k$ independently sampled solutions passes a verifier. This multi-sample inference metric has motivated inference-aware fine-tuning methods that directly optimize pass@$k$. However, prior work reports a recurring trade-off: pass@k improves while pass@1 degrades under such methods. This trade-off is practically important because pass@1 often remains a hard operational constraint due to latency and cost budgets, imperfect verifier coverage, and the need for a reliable single-shot fallback. We study the origin of this trade-off and provide a theoretical characterization of when pass@k policy optimization can reduce pass@1 through gradient conflict induced by prompt interference. We show that pass@$k$ policy gradients can conflict with pass@1 gradients because pass@$k$ optimization implicitly reweights prompts toward low-success prompts; when these prompts are what we term negatively interfering, their upweighting can rotate the pass@k update direction away from the pass@1 direction. We illustrate our theoretical findings with large language model experiments on verifiable mathematical reasoning tasks.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References31

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Related Papers