Apr 28, 2026arXiv:2604.25119

Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM

Vinith M. Suriyakumar, Ayush Sekhari, Lena Stempfle, Robertson Wang, Michael Simpson, Rebecca S. Portnoff, Marzyeh Ghassemi, Ashia C. Wilson

AI Summary

The paper introduces "Evaluation without Generation," a framework for assessing harmful model specialization by analyzing internal representations rather than generated outputs, addressing limitations in auditing open-weight generative models, particularly in domains like CSAM. They propose Gaussian probing, which characterizes how LoRA adaptors perturb a model's internal representations by measuring responses to Gaussian latent ensembles. Experiments demonstrate that Gaussian probing effectively distinguishes benign from harmful specialization in high-risk domains, even against adversarial manipulations like weight rescaling, offering a scalable alternative to output-based evaluation.

Key Contribution

You can now detect harmful specializations in generative models, like those trained on CSAM, without ever generating a single risky output.

Abstract

Auditing the fine-tunes of open-weight generative models for harmful specialization has become a new governance challenge for model hosting platforms. The standard toolkit, generative evaluation via curated prompts or red-teaming, does not scale to platform-level auditing and breaks down entirely for domains like CSAM where generation is legally constrained. This motivates the Evaluation without Generation problem: assessing model capabilities without producing outputs. We argue that in such settings, capability must be inferred from the model's state, either its parameters or internal representations, rather than its outputs. We introduce Gaussian probing, a method that characterizes how LoRA adaptors perturb a model's internal representations by measuring responses to Gaussian latent ensembles. Unlike raw-weight baselines, Gaussian probing reliably distinguishes benign from harmful specialization without sampling outputs. We demonstrate effectiveness in high-risk domains, including detecting models specialized for child sexual abuse material (CSAM), where output-based evaluation is legally and ethically constrained. Our results show that Gaussian probing provides a scalable non-generative alternative for evaluating high-risk generative systems and remains robust to weight rescaling, a representative adversarial manipulation.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM

Related Papers