Search papers, labs, and topics across Lattice.
Kyoto University
3
0
8
RL models trained with verifiable rewards exhibit a surprising deductive-over-abductive reasoning asymmetry, even in controlled environments, suggesting a fundamental challenge in current RLVR approaches.
Forget comparing models with benchmarks – mapping them by prompt-response likelihoods reveals hidden relationships between architecture, training data, and even how prompts compose.
Forget expensive distillation – aligning language models can be as simple as carefully choosing the right mix of pretraining data based on log-likelihood differences.