Search papers, labs, and topics across Lattice.
Tongji University
1
0
3
LLM code generation benchmarks are likely overestimating model capabilities: adversarial test suite scaling reveals substantial weaknesses in even state-of-the-art models.