Center for Machine LearningLMUTU MunichJun 14, 2026arXiv:2606.16011

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

Nafiseh Nikeghbal, Amir Hossein Kargaran, Shaghayegh Kolli, Jana Diesner

AI Summary

This study introduces a novel protocol for evaluating the stability of large language models (LLMs) when faced with counterarguments, revealing that traditional accuracy benchmarks fail to capture significant fluctuations in answer consistency. By challenging LLMs with coherent arguments for incorrect options, the authors observed flip rates ranging from 17.5% to 97.3% across seven models, highlighting substantial variability in stability. Notably, self-attribution was found to increase flip rates, and a curated challenge set, MaxFlip, was developed to enhance the effectiveness of adversarial challenges, amplifying flip rates by up to 23.6%.

Key Contribution

LLMs can flip their answers up to 97.3% of the time when faced with counterarguments, revealing a critical instability that accuracy metrics overlook.

Abstract

Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at https://github.com/nafisenik/WhoFlips and https://hf.co/datasets/nafisehNik/WhoFlips.

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

Related Papers