Search papers, labs, and topics across Lattice.
4
0
7
Fine-tuning on the DeNovoSWE dataset boosts long-horizon software engineering performance by over 40 percentage points, revealing the potential of LLMs in complete repository generation.
Ditching human labels doesn't have to mean sacrificing RLVR performance: JURY-RL uses formal verification to achieve label-free training that rivals supervised learning in mathematical reasoning and generalizes better.
Autonomous ML research agents achieve significantly better long-horizon performance by maintaining durable state through a shared workspace, suggesting that orchestration and memory are more critical than raw reasoning power.
Today's code-generating AI falls apart when faced with real-world software engineering tasks that demand cross-repository reasoning and external knowledge, achieving less than 45% success on the new BeyondSWE benchmark.