Search papers, labs, and topics across Lattice.
This paper introduces a framework to evaluate the out-of-distribution (OOD) generalization of chain-of-thought (CoT) reasoning in multimodal LLMs on a grid-based navigation task. The authors fine-tune models with visual and textual input representations and CoT strategies, then rigorously test them under ID and OOD conditions, specifically varying map size. Results show that while CoT improves ID generalization, OOD generalization is limited, but can be improved by combining multiple text formats in reasoning traces, and that text-based models outperform image-based models.
Chain-of-thought reasoning in multimodal LLMs falls apart when faced with slightly different scenarios, but surprisingly, mixing text formats in the reasoning process helps them generalize better than using images.
Integrating reasoning in large language models and large vision-language models has recently led to significant improvement of their capabilities. However, the generalization of reasoning models is still vaguely defined and poorly understood. In this work, we present an evaluation framework to rigorously examine how well chain-of-thought (CoT) approaches generalize on a simple planning task. Specifically, we consider a grid-based navigation task in which a model is provided with a map and must output a sequence of moves that guides a player from a start position to a goal while avoiding obstacles. The versatility of the task and its data allows us to fine-tune model variants using different input representations (visual and textual) and CoT reasoning strategies, and systematically evaluate them under both in-distribution (ID) and out-of-distribution (OOD) test conditions. Our experiments show that, while CoT reasoning improves in-distribution generalization across all representations, out-of-distribution generalization (e.g., to larger maps) remains very limited in most cases when controlling for trivial matches with the ID data. Surprisingly, we find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization. Finally, purely text-based models consistently outperform those utilizing image-based inputs, including a recently proposed approach relying on latent space reasoning.