Search papers, labs, and topics across Lattice.
The paper investigates the vulnerability of AI age estimation systems to low-cost cosmetic attacks designed to make minors appear older. They simulate cosmetic changes like beards, grey hair, makeup, and wrinkles on facial images of individuals aged 10-21 using a VLM image editor and evaluate the impact on eight different age estimation models. The study finds that these simple cosmetic modifications can significantly increase the likelihood of minors being classified as adults, with a combination of all four attacks achieving up to 83% Attack Conversion Rate (ACR) across models.
A simple synthetic beard fools AI age estimators into classifying minors as adults up to 69% of the time, exposing a critical flaw in current age-verification systems.
Age estimation systems are increasingly deployed as gatekeepers for age-restricted online content, yet their robustness to cosmetic modifications has not been systematically evaluated. We investigate whether simple, household-accessible cosmetic changes, including beards, grey hair, makeup, and simulated wrinkles, can cause AI age estimators to classify minors as adults. To study this threat at scale without ethical concerns, we simulate these physical attacks on 329 facial images of individuals aged 10 to 21 using a VLM image editor (Gemini 2.5 Flash Image). We then evaluate eight models from our prior benchmark: five specialized architectures (MiVOLO, Custom-Best, Herosan, MiViaLab, DEX) and three vision-language models (Gemini 3 Flash, Gemini 2.5 Flash, GPT-5-Nano). We introduce the Attack Conversion Rate (ACR), defined as the fraction of images predicted as minor at baseline that flip to adult after attack, a population-agnostic metric that does not depend on the ratio of minors to adults in the test set. Our results reveal that a synthetic beard alone achieves 28 to 69 percent ACR across all eight models; combining all four attacks shifts predicted age by +7.7 years on average across all 329 subjects and reaches up to 83 percent ACR; and vision-language models exhibit lower ACR (59 to 71 percent) than specialized models (63 to 83 percent) under the full attack, although the ACR ranges overlap and the difference is not statistically tested. These findings highlight a critical vulnerability in deployed age-verification pipelines and call for adversarial robustness evaluation as a mandatory criterion for model selection.