I model generationsTU DarmstadtMay 27, 2026arXiv:2605.28137

No Safe Dose: How Training Data Drives Unsafe Image Generation

Felix Friedrich, Lukas Helff, Niharika Hegde, Patrick Schramowski

AI Summary

This paper investigates the direct impact of unsafe training data on the safety of generated images from text-to-image models. By training models on datasets with varying proportions of unsafe images (0-9.6%), the authors demonstrate a monotonic increase in unsafe outputs, even with relatively small amounts of contamination. They further show that the proportion of unsafe data, not the absolute amount, is the key driver, and that even with a completely safe training set, a baseline level of unsafety remains due to other components like the text encoder.

Key Contribution

Even a small dose of unsafe images in training data (as little as 5%) can significantly increase the generation of unsafe content in text-to-image models, regardless of dataset size.

Abstract

Text-to-image models trained on large-scale data often inevitably ingest unsafe content. While some people observe input-output amplifications, it remains unclear whether and how training data composition directly drives model output safety or by other factors. We shed light on this question by isolating this variable: we train the same text-to-image model on datasets that differ \emph{only} in their fraction of unsafe images (0\% to 9.6\%), across several dataset scales (100K to 8M). Then we generate images with the resulting models, and evaluate them with four independent safety classifiers. Output unsafety rises monotonically from 16.6\% at 0\% contamination to 25.5\% at 5\%. A factorial design reveals that the \emph{proportion}, not the absolute count, of unsafe training images is the operative variable. The 16.6\% irreducible baseline at zero contamination implicates the other components, e.g. frozen text encoder, as a residual safety risk -- confirmed by a text encoder ablation showing that SafeCLIP reduces this floor to 9.6\%, while the dose-response effect persists across all three encoders tested. Critically, no quality degradation in terms of FID, CLIPscore and ImageReward accompanies safety filtering. These results establish that data curation and text encoder safety are complementary and independently effective interventions. At the same time, the remaining level of unsafety poses questions for future research about emerging capabilities and compositionality.

Data Curation & Synthetic Data Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

No Safe Dose: How Training Data Drives Unsafe Image Generation

Related Papers