Apr 23, 2026arXiv:2604.21593

Language as a Latent Variable for Reasoning Optimization

Linjuan Wu, Haoran Wei, Jialong Tang, Shuang Luo, Baosong Yang, Yongliang Shen, Weiming Lu

AI Summary

The paper investigates the phenomenon of non-English responses outperforming English in LLM reasoning tasks, positing that language acts as a latent variable modulating internal inference pathways. They introduce the Polyglot Thinking Experiment to demonstrate that unconstrained language often leads to higher accuracy, suggesting a broader reasoning space. Based on this, they propose polyGRPO, an RL framework that leverages language variation as an exploration signal, achieving significant accuracy improvements on reasoning tasks after training on multilingual math problems.

Key Contribution

LLMs can reason better when they're not forced to answer in English, and a new RL method leverages this quirk to boost performance across reasoning tasks.

Abstract

As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model's internal inference pathways, rather than merely serving as an output medium. To test this, we conducted a Polyglot Thinking Experiment, in which models were prompted to solve identical problems under language-constrained and language-unconstrained conditions. Results show that non-English responses often achieve higher accuracy, and the best performance frequently occur when language is unconstrained, suggesting that multilinguality broadens the model's latent reasoning space. Based on this insight, we propose polyGRPO (Polyglot Group Relative Policy Optimization), an RL framework that treats language variation as an implicit exploration signal. It generates polyglot preference data online under language-constrained and unconstrained conditions, optimizing the policy with respect to both answer accuracy and reasoning structure. Trained on only 18.1K multilingual math problems without chain-of-thought annotations, polyGRPO improves the base model (Qwen2.5-7B-Instruct) by 6.72% absolute accuracy on four English reasoning testset and 6.89% in their multilingual benchmark. Remarkably, it is the only method that surpasses the base LLM on English commonsense reasoning task (4.9%), despite being trained solely on math data-highlighting its strong cross-task generalization. Further analysis reveals that treating language as a latent variable expands the model's latent reasoning space, yielding consistent and generalizable improvements in reasoning performance.

Natural Language Processing Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References30

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Language as a Latent Variable for Reasoning Optimization

Related Papers