Search papers, labs, and topics across Lattice.
This paper introduces a massive option evaluation protocol for multiple choice benchmarks, scaling the candidate set to 100 options to reduce the impact of chance performance and expose limitations in LLMs. They apply this framework to a Korean orthography error detection task, revealing that strong performance in low-option settings often overstates model competence. The study identifies semantic confusion and position bias as key failure modes, demonstrating that candidate ranking, rather than context length, is the primary bottleneck.
LLMs that ace standard multiple choice tests can crumble when the option count explodes, revealing hidden weaknesses in semantic understanding and a surprising bias towards the first answer choices.
Multiple choice evaluation is widely used for benchmarking large language models, yet near ceiling accuracy in low option settings can be sustained by shortcut strategies that obscure true competence. Therefore, we propose a massive option evaluation protocol that scales the candidate set to one hundred options and sharply reduces the impact of chance performance. We apply this framework to a Korean orthography error detection task where models must pick the single incorrect sentence from a large candidate set. With fixed targets and repeated resampling and shuffling, we obtain stable estimates while separating content driven failures from positional artifacts. Across experiments, results indicate that strong performance in low option settings can overstate model competence. This apparent advantage often weakens under dense interference at high $N$, revealing gaps that conventional benchmarks tend to obscure. We identify two failure modes, semantic confusion and position bias toward early options under uncertainty. To isolate the effect of context length, we run padding controlled and length matched tests, which suggest that the main bottleneck is candidate ranking rather than context length. Together, these findings support massive option evaluation as a general framework for stress testing model reliability under extreme distractor density, beyond what low option benchmarks can reveal.