Mar 17, 2026arXiv:2603.16798

High-Dimensional Gaussian Mean Estimation under Realizable Contamination

Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas

AI Summary

This paper studies the problem of estimating the mean of a Gaussian distribution with identity covariance in high dimensions under a realizable $\epsilon$-contamination model, where data points are missing with probability $r(x)$ chosen by an adversary. The authors prove an information-computation gap in the Statistical Query (SQ) model, demonstrating that algorithms must either use significantly more samples than information-theoretically necessary or incur exponential runtime. They also provide an algorithm with a sample-time tradeoff that nearly matches the lower bound, thus characterizing the complexity of the problem.

Key Contribution

Even with a realizable missing data model, estimating the mean of a high-dimensional Gaussian provably requires either exponentially more samples or exponential runtime, revealing a fundamental information-computation tradeoff.

Abstract

We study mean estimation for a Gaussian distribution with identity covariance in $\mathbb{R}^d$ under a missing data scheme termed realizable $ε$-contamination model. In this model an adversary can choose a function $r(x)$ between 0 and $ε$ and each sample $x$ goes missing with probability $r(x)$. Recent work Ma et al., 2024 proposed this model as an intermediate-strength setting between Missing Completely At Random (MCAR) -- where missingness is independent of the data -- and Missing Not At Random (MNAR) -- where missingness may depend arbitrarily on the sample values and can lead to non-identifiability issues. That work established information-theoretic upper and lower bounds for mean estimation in the realizable contamination model. Their proposed estimators incur runtime exponential in the dimension, leaving open the possibility of computationally efficient algorithms in high dimensions. In this work, we establish an information-computation gap in the Statistical Query model (and, as a corollary, for Low-Degree Polynomials and PTF tests), showing that algorithms must either use substantially more samples than information-theoretically necessary or incur exponential runtime. We complement our SQ lower bound with an algorithm whose sample-time tradeoff nearly matches our lower bound. Together, these results qualitatively characterize the complexity of Gaussian mean estimation under $ε$-realizable contamination.

Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

High-Dimensional Gaussian Mean Estimation under Realizable Contamination

Related Papers