Stanford HAIDepartment of Computer ScienceDepartment of Data Sciences and OperationsDepartment of Statistics and DataUSCMar 17, 2026arXiv:2603.16712

High-dimensional estimation with missing data: Statistical and computational limits

Kabir Aladin Verchand, Ankit Pensia, Saminul Haque, Rohith Kuditipudi

AI Summary

This paper investigates the statistical and computational limits of high-dimensional estimation problems (mean, covariance, and linear regression) when data is subject to missing data under a realizable contamination model. For Gaussian mean and covariance estimation, the authors demonstrate statistical-computational gaps, showing that while inefficient algorithms can achieve a certain error rate with a specific sample complexity, computationally efficient methods require significantly more samples. However, for linear regression with missing data, they prove that a simple, strongly convex empirical risk minimization achieves near-optimal performance in polynomial time, indicating no such gap.

Key Contribution

Expect to pay an exponential sample complexity price for computationally efficient mean and covariance estimation with missing data, but not for linear regression.

Abstract

We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $ε$ fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in $\ell_2$ norm, we show that in order to obtain error at most $ρ$, for any constant contamination $ε\in (0, 1)$, (roughly) $n \gtrsim d e^{1/ρ^2}$ samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) $n \gtrsim d^{1/ρ^2}$ and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.

Data Curation & Synthetic Data Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

High-dimensional estimation with missing data: Statistical and computational limits

Related Papers