AI Safety Argentina (AISAR)GoodfireMar 4, 2026arXiv:2603.04045

Inference-Time Toxicity Mitigation in Protein Language Models

Manuel Fernández Burda, Santiago Aranguri, Iván Arcuschin Moreno, Enzo Ferrante

AI Summary

The authors demonstrate that domain adaptation of Protein Language Models (PLMs) to specific taxonomic groups can inadvertently elicit the generation of toxic protein sequences. To mitigate this, they adapt Logit Diff Amplification (LDA), an inference-time technique that modifies token probabilities based on the logit difference between a baseline PLM and a toxicity-finetuned PLM, without requiring retraining. Experiments across four taxonomic groups show that LDA effectively reduces predicted toxicity while maintaining biological plausibility and structural viability, as measured by Fréchet ESM Distance and pLDDT.

Key Contribution

You can dial down toxicity in protein-generating AI without sacrificing the quality or foldability of the proteins it designs.

Abstract

Protein language models (PLMs) are becoming practical tools for de novo protein design, yet their dual-use potential raises safety concerns. We show that domain adaptation to specific taxonomic groups can elicit toxic protein generation, even when toxicity is not the training objective. To address this, we adapt Logit Diff Amplification (LDA) as an inference-time control mechanism for PLMs. LDA modifies token probabilities by amplifying the logit difference between a baseline model and a toxicity-finetuned model, requiring no retraining. Across four taxonomic groups, LDA consistently reduces predicted toxicity rate (measured via ToxDL2) below the taxon-finetuned baseline while preserving biological plausibility. We evaluate quality using Fréchet ESM Distance and predicted foldability (pLDDT), finding that LDA maintains distributional similarity to natural proteins and structural viability (unlike activation-based steering methods that tend to degrade sequence properties). Our results demonstrate that LDA provides a practical safety knob for protein generators that mitigates elicited toxicity while retaining generative quality.

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Inference-Time Toxicity Mitigation in Protein Language Models

Related Papers