Search papers, labs, and topics across Lattice.
3
0
4
Uncovering that CoT reasoning transitions from uncertainty to high reliability opens the door to more efficient inference strategies that can save computational resources without sacrificing accuracy.
Ditch hard clipping: ratio-variance regularization offers a principled, "soft brake" approach to trust region policy optimization, unlocking substantial gains in sample efficiency and performance, especially for smaller LLMs.
Plasticity loss in RL isn't just about forgetting; it's about vanishing gradients, and a simple sample re-weighting can bring back the learning.