Search papers, labs, and topics across Lattice.
Peking University
2
0
5
Text-to-image models can be tricked into generating images containing malicious text with over 90% success, even when standard jailbreak methods fail.
RLHF can be made more stable and effective by explicitly verifying and reinforcing policy improvements against a historical baseline, rather than relying solely on instantaneous reward signals.