Search papers, labs, and topics across Lattice.
Reasoning models are surprisingly bad at controlling their own thoughts: Claude Sonnet 4.5 can control its chain-of-thought only 2.7% of the time, raising questions about the reliability of CoT monitoring.
Forget fine-tuning: inject targeted time-series insights into general LLMs and watch their reasoning skills skyrocket by up to 26%.
Turns out, Claude 3.5 Sonnet and o4-mini are surprisingly good at geospatial tasks, outperforming even GPT-4.1 and Gemini 2.5 Pro Preview on a new benchmark for tool-calling LLMs.