Search papers, labs, and topics across Lattice.
3
0
7
8
Sentence-level watermarks can now survive aggressive paraphrasing attacks like sentence splitting and merging, thanks to a new alignment-based approach.
Weak-to-strong reward models can ace the test but still fail in the real world, revealing a hidden brittleness in current preference learning approaches.
A dedicated guard agent, trained via reasoning-intensive methods, can effectively neutralize prompt injection attacks in web-navigating agents without sacrificing performance.