Search papers, labs, and topics across Lattice.
4
0
7
Don't let valuable steps in failed trajectories go unnoticed: GraphGPO leverages state-transition graphs for fine-grained credit assignment in agentic RL, boosting performance and efficiency.
Chain-of-thought prompting makes large language models smarter, but it also makes them less safe, a problem this paper tackles by forcing models to think about safety *before* reasoning.
Context inconsistency in stepwise group-based RL can severely bias advantage estimation, but a hierarchical grouping strategy can fix it without extra compute.
Overcome simplicity bias in RL agents with PA-MoE, a mixture-of-experts architecture that learns task phases directly from the RL objective, leading to better expert specialization.