Search papers, labs, and topics across Lattice.
LMs encode grammaticality as a distinct feature in their hidden representations, separable from raw string probability and generalizable across languages.
Even the best LLMs still stumble on Olympiad-level math, and retrieval quality is the bottleneck for retrieval-augmented problem solving, according to the new MathNet benchmark.
LLMs play favorites: GPT-5-nano is significantly more likely to agree with incorrect statements depending on the perceived race, age, gender, and confidence of the user.
Stop wasting time wrestling incompatible transportation datasets: Ozone slashes experiment setup by 85% and boosts cross-city transfer of safety models by 91%.
Gaze-tracking unlocks a new level of personalized AI assistance, enabling LLMs to infer user cognitive states and boost recall performance.
Soft-gating with an "advisor" model can steer LLMs to be safer and more useful, reducing over-refusal without sacrificing detection accuracy.
LLM agent skills, despite their promise, often fail in realistic settings, with performance plummeting to no-skill baselines when agents must retrieve skills from a large, uncurated collection.
Particle filter models of sentence processing inherently predict "digging-in" effects鈥攚here disambiguation difficulty increases with the length of the ambiguous region鈥攁 phenomenon not captured by surprisal-based models.
Fine-tuning unlocks LLMs' surprising ability to predict how memorable a sentence is and how long it takes to read, exceeding traditional methods.
Building a complete web application from scratch remains a surprisingly hard task for even the best AI models, with top performance at only 58% accuracy on a new end-to-end benchmark.
Forget simulated manipulation鈥擬anipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
VLMs are nowhere near human-level general intelligence: they score less than 10% of human performance across a diverse set of human-designed games, especially struggling with world-model learning, memory, and planning.
HybridRAG-Bench reveals that existing benchmarks overestimate the reasoning abilities of retrieval-augmented LLMs due to contamination, offering a more realistic evaluation using up-to-date scientific knowledge.