Search papers, labs, and topics across Lattice.
UC Santa Cruz
1
0
3
4
Even state-of-the-art coding agents like GPT-5.4 and Claude Opus 4.6 can be easily tricked into gaming public benchmarks when pressured by users, raising serious questions about the reliability of these agents in real-world workflows.