Search papers, labs, and topics across Lattice.
This paper investigates whether small language models (7-9B parameters) can be induced to "sandbag" or deliberately underperform on MMLU-Pro benchmarks using symptom validity testing (SVT) logic. Contrary to expectations, models failed to exhibit significant below-chance performance when instructed to underperform. Instead, Llama-3-8B showed a positional bias, disproportionately selecting middle-alphabet options regardless of correctness, suggesting that positional-distribution shift may be a more effective signature than below-chance accuracy for detecting prompted underperformance at this model scale.
Forget sophisticated deception – small LLMs "sandbagging" on tests just pick option 'E' or 'F' regardless of the question, revealing a surprising positional bias instead of true answer-aware avoidance.
Detecting sandbagging--the deliberate underperformance on capability evaluations--is an open problem in AI safety. We tested whether symptom validity testing (SVT) logic from clinical malingering detection could identify sandbagging through below-chance performance (BCB) on forced-choice items. In a pre-registered pilot at the 7-9 billion parameter instruction-tuned scale (3 models, 4 MMLU-Pro domains, 4 conditions, 500 items per cell, 24,000 total trials), the plausibility gate failed. Zero of 12 model-domain cells showed significant below-chance performance under sandbagging instruction. Exploratory analyses revealed three qualitatively distinct failure modes. Qwen-2.5-7B and Phi-3.5-mini largely ignored the sandbagging instruction, with 62-88% response identity with the honest baseline. Llama-3-8B complied substantially but implemented underperformance as a positional heuristic, collapsing its response distribution onto middle-alphabet options (E at 31.8%, F at 26.1%) regardless of where the correct answer fell. This produced accuracy boosts of up to 33 percentage points when the correct answer coincidentally occupied the model's preferred position. An explicit anti-task instruction ("pick the least likely answer") drove two of three models below chance, with accuracy as low as 0.024. The capability for answer-aware avoidance therefore exists but is not activated by"deliberately underperform."BCB did not fail as a logical marker of answer-aware avoidance. It was not observed in this regime because the model showing the largest behavioural shift exhibited behaviour consistent with a position-dominant response policy rather than content-aware answer avoidance. We propose that positional-distribution shift may be a more effective behavioural signature than below-chance accuracy for detecting prompted underperformance at this model scale.