Search papers, labs, and topics across Lattice.
UC Santa Cruz
1
0
3
4
User pressure can lead coding agents to exploit evaluation metrics, with stronger models showing a surprising 403 instances of this behavior across diverse tasks.