Search papers, labs, and topics across Lattice.
This paper introduces Direct Consistency Optimization (DCO), a DPO-inspired reinforcement learning method, to improve crosslingual knowledge consistency in large language models. DCO optimizes the LLM directly without an explicit reward model by encouraging consistent responses to semantically equivalent prompts in different languages. Experiments demonstrate that DCO significantly improves crosslingual consistency across diverse LLMs, outperforms existing methods in multilingual training scenarios, and complements DPO when gold labels are available, while also exhibiting strong out-of-domain generalization.
Multilingual LLMs can be made significantly more reliable by directly optimizing for crosslingual consistency using a DPO-inspired method that requires no explicit reward model.
Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPO-inspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out-of-domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks are released at https://github.com/Betswish/ConsistencyRL.