Search papers, labs, and topics across Lattice.
3
0
6
By progressively refining the reward signal based on the distribution of model confidence, DistriTTRL achieves significant performance gains in RL by better aligning internal information between training and test time and mitigating reward hacking.
Instead of directly aligning to a flawed pseudo-source domain in test-time adaptation, a semantic bridge approach significantly boosts performance by first rectifying the pseudo-source using universal semantics.
By modeling the distribution of confidence scores, DistriVoting significantly boosts the accuracy of large reasoning models, outperforming existing confidence-based selection methods across diverse benchmarks.