Search papers, labs, and topics across Lattice.
5
0
9
8
Forget hand-crafting mobile benchmarks – PhoneWorld lets you automatically generate them from real-world GUI trajectories, leading to massive performance gains for phone-use agents.
Long-context LLM rankings dramatically reshuffle when evaluated across a range of context lengths and capabilities, proving that a single headline score is misleading.
LLM agents still fail to reliably automate real-world workflows, with even the best models succeeding on only two-thirds of tasks in a new live benchmark.
Pruning reasoning paths with a learned "STOP" token slashes compute costs and boosts accuracy in large reasoning models, outperforming existing methods.
Current phone-use agents are often *too* helpful, routinely violating user privacy by filling in unnecessary personal information even when a task doesn't require it.