Mar 17, 2026arXiv:2603.16197

Are Large Language Models Truly Smarter Than Humans?

AI Summary

This paper investigates the extent of contamination in LLM benchmarks by auditing six frontier models (GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B) using lexical detection, paraphrase diagnostics, and TS-guessing behavioral probes on the MMLU dataset. The study finds significant contamination, with up to 66.7% of Philosophy questions being contaminated and performance dropping by an average of 7.0 percentage points under indirect reference. The experiments reveal a consistent contamination ranking across domains, highlighting the risk of overestimating LLM capabilities due to benchmark leakage.

Key Contribution

LLMs' apparent superhuman performance on benchmarks may be a mirage: contamination inflates scores by up to 20% in some domains, revealing a critical flaw in current evaluation practices.

Abstract

Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to 100 MMLU questions, finding accuracy drops by an average of 7.0 percentage points under indirect reference, rising to 19.8 pp in both Law and Ethics. Experiment 3 applies TS-Guessing behavioral probes to all 513 questions and all six models, finding that 72.5% trigger memorization signals far above chance, with DeepSeek-R1 displaying a distributed memorization signature (76.6% partial reconstruction, 0% verbatim recall) that explains its anomalous Experiment 2 profile. All three experiments converge on the same contamination ranking: STEM>Professional>Social Sciences>Humanities.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References12

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Are Large Language Models Truly Smarter Than Humans?

Related Papers