Search papers, labs, and topics across Lattice.
2
0
4
LLM-based judges, widely used for automated evaluation, are riddled with diverse biases that can be significantly reduced through bias-aware training using RL and contrastive learning.
Token-level policy gradients fall short in complex reasoning tasks, but treating sequences of tokens as unified actions can significantly boost performance in mathematical and coding benchmarks.