CUHKHKUSTJun 17, 2026arXiv:2606.18961

Be Your Own Teacher: Steering Protein Language Models via Unsupervised Reward Optimization

Lanqing Li, Shentong Mo, Yang Yu, Pheng-Ann Heng

AI Summary

This paper introduces unsupervised reward optimization for protein language models (PLMs), enabling controllable protein generation without the need for costly labeled datasets or wet-lab validation. The authors demonstrate that task-agnostic rewards, derived from model uncertainty and semantic consistency, correlate strongly with controllability, leading to the development of two effective offline algorithms: Soft Reward Optimization (SRO) and Binarized Reward Optimization (BRO). Experimental results show that these methods significantly outperform existing baselines and achieve near-oracle performance across various conditions, highlighting their potential for scalable biomolecular design.

Key Contribution

Unsupervised reward optimization allows protein language models to self-improve, achieving near-oracle performance without the need for labeled data.

Abstract

Protein language models (PLMs) have emerged as powerful tools for controllable biomolecular design, yet their post-training adaptation typically relies on costly wet-lab validation or curated preference datasets. To overcome this supervision bottleneck, we introduce unsupervised reward optimization of PLMs, a comprehensive framework for steerable protein generation without ground-truth labels. Our key insight is that task-agnostic rewards, which combine intrinsic model uncertainty with extrinsic semantic consistency informed by protein representation models, exhibit strong correlation with controllability measures across base models and temperature regimes. Building upon this discovery, we propose two offline algorithms: Soft Reward Optimization (SRO) and Binarized Reward Optimization (BRO), which effectively maximize the classical RLHF objective induced by these proxy rewards. Extensive experiments on compositional out-of-distribution prompts demonstrate that both methods significantly outperform competitive baselines (DPO, KTO), while approaching oracle performance across multiple sampling temperatures, model scales and protein families. Moreover, PLMs fine-tuned with unsupervised rewards can achieve consistently higher coverage compared to their base model in pass@k evaluations. By enabling self-improvement of PLMs through their own generated experience, our framework provides a scalable pathway toward controllable biomolecular design in settings where labeled preferences or experimental feedback are scarce or unavailable.

RLHF & Preference Learning Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Be Your Own Teacher: Steering Protein Language Models via Unsupervised Reward Optimization

Related Papers