Search papers, labs, and topics across Lattice.
The paper introduces GSU, a text-based grid dataset designed to evaluate LLMs' spatial reasoning across navigation, object localization, and structure composition tasks. By removing visual inputs, the authors isolate spatial reasoning and demonstrate that while LLMs understand basic grid concepts, they struggle with embodied frames of reference and 3D shape identification from coordinates. Fine-tuning smaller models on GSU shows promise in matching the performance of larger frontier models, suggesting a path towards specialized embodied agents.
LLMs struggle with spatial reasoning in embodied settings and 3D structure identification even when exposed to visual modalities, but fine-tuning smaller models offers a surprisingly effective alternative to brute-force scaling.
We introduce GSU, a text-only grid dataset to evaluate the spatial reasoning capabilities of LLMs over 3 core tasks: navigation, object localization, and structure composition. By forgoing visual inputs, isolating spatial reasoning from perception, we show that while most models grasp basic grid concepts, they struggle with frames of reference relative to an embodied agent and identifying 3D shapes from coordinate lists. We also find that exposure to a visual modality does not provide a generalizable understanding of 3D space that VLMs are able to utilize for these tasks. Finally, we show that while the very latest frontier models can solve the provided tasks (though harder variants may still stump them), fully fine-tuning a small LM or LORA fine-tuning a small LLM show potential to match frontier model performance, suggesting an avenue for specialized embodied agents.