Search papers, labs, and topics across Lattice.
2
0
5
Unsupervised RL for text generation doesn't have to collapse into gibberish: rewarding relative information gain between specialist and generalist policies unlocks meaningful content creation.
On-policy RL (GRPO) makes LLMs significantly better at vulnerability detection than SFT or preference optimization, outperforming even strong zero-shot baselines.