Search papers, labs, and topics across Lattice.
Delft University of Technology
2
0
6
4
LLM agents are surprisingly inept at Capture The Flag challenges, with even the best models only completing 35% of checkpoints, revealing a significant gap in their ability to perform realistic offensive security tasks.
Agent-generated code is more likely to be reworked or removed entirely, suggesting current AI coding tools may increase code churn despite boosting initial contribution rates.