Search papers, labs, and topics across Lattice.
4
0
7
13
Even the best large vision-language models struggle with multi-image reasoning, scoring only 50% on a new benchmark designed to challenge their capabilities.
Social intelligence may be a qualitatively different beast than analytical reasoning: a 7B model trained with SAVOIR beats GPT-4o and Claude-3.5-Sonnet on social interaction, while large reasoning models lag behind.
STRATAGEM reveals that selectively reinforcing reasoning trajectories can dramatically enhance a model's ability to transfer reasoning skills across diverse tasks, especially in complex mathematical scenarios.
LLMs struggle to automatically apply learned procedures or avoid failed actions without explicit reminders, achieving only up to 66% on a new implicit memory benchmark.