Search papers, labs, and topics across Lattice.
5
0
8
7
Forget hand-crafting mobile benchmarks – PhoneWorld lets you automatically generate them from real-world GUI trajectories, leading to massive performance gains for phone-use agents.
RL fine-tuning can make your role-playing agent *worse* at embodying its character, unless you carefully balance task rewards with stylistic constraints.
LLM agents still fail to reliably automate real-world workflows, with even the best models succeeding on only two-thirds of tasks in a new live benchmark.
Pruning reasoning paths with a learned "STOP" token slashes compute costs and boosts accuracy in large reasoning models, outperforming existing methods.
Current phone-use agents are often *too* helpful, routinely violating user privacy by filling in unnecessary personal information even when a task doesn't require it.