Search papers, labs, and topics across Lattice.
RLearner-LLM addresses the logical alignment gap in DPO-trained LLMs by introducing Hybrid-DPO, which combines NLI signals from DeBERTa-v3 with a verifier LLM score to mitigate verbosity bias. This automated preference pipeline improves logical correctness without human annotation, achieving up to 6x NLI improvement over SFT models across five academic domains and three base architectures. The method demonstrates that smaller models like Gemma 4 E4B-it can achieve significant NLI gains and faster inference through Hybrid-DPO, highlighting the importance of logic-aware metrics over LLM judges for knowledge-intensive generation.
LLMs can get up to 6x more logically consistent without human feedback, simply by fusing NLI scores into the DPO training loop.
Direct Preference Optimization (DPO), the efficient alternative to PPO-based RLHF, falls short on knowledge-intensive generation: standard preference signals from human annotators or LLM judges exhibit a systematic verbosity bias that rewards fluency over logical correctness. This blindspot leaves a logical alignment gap -- SFT models reach NLI entailment of only 0.05-0.22 despite producing fluent text. We propose RLearner-LLM with Hybrid-DPO: an automated preference pipeline that fuses a DeBERTa-v3 NLI signal with a verifier LLM score, removing human annotation while overcoming the"alignment tax"of single-signal optimization. Evaluated across five academic domains (Biology, Medicine, Law) with three base architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it), RLearner-LLM yields up to 6x NLI improvement over SFT, with NLI gains in 11 of 15 cells and consistent answer-coverage gains. On Gemma 4 E4B-it (4.5B effective params), Hybrid-DPO lifts NLI in four of five domains (+11.9% to +2.4x) with faster inference across all five, scaling down to compact base models without losing the alignment-tax mitigation. Our Qwen3-8B RLearner-LLM wins 95% of pairwise comparisons against its own SFT baseline; GPT-4o-mini in turn wins 95% against our concise output -- alongside the 69% win the same judge gives a verbose SFT over our DPO model, this replicates verbosity bias on a frontier comparator and motivates logic-aware metrics (NLI, ACR) over LLM-as-a-judge for knowledge-intensive generation.