Search papers, labs, and topics across Lattice.
Using a top or bottom-performing LLM as an anchor in "LLM-as-a-judge" benchmarks can dramatically skew results, making the choice of a mediocre anchor key to reliable evaluation.
Particle filter models of sentence processing inherently predict "digging-in" effects—where disambiguation difficulty increases with the length of the ambiguous region—a phenomenon not captured by surprisal-based models.
Fine-tuning unlocks LLMs' surprising ability to predict how memorable a sentence is and how long it takes to read, exceeding traditional methods.
Building a complete web application from scratch remains a surprisingly hard task for even the best AI models, with top performance at only 58% accuracy on a new end-to-end benchmark.
LLMs that ace static code-fixing benchmarks may still struggle to maintain code quality over the long, iterative haul of real-world software development.
Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
Predict how well your LLM will transfer to a new domain *before* fine-tuning, by using sparse autoencoders to spot tell-tale signs of domain shift in the model's representations.
LLMs struggle to reliably predict numerical materials properties, even after fine-tuning, and their performance fluctuates wildly over time, casting doubt on their use in high-stakes scientific applications.
Agentic AI can automate complex optical systems control with near-perfect success rates, leaving code-generation approaches in the dust.
Randomly initialized encoders can match state-of-the-art pre-trained models on many ECG representation learning tasks, suggesting current benchmarks are misleading.
VLMs are nowhere near human-level general intelligence: they score less than 10% of human performance across a diverse set of human-designed games, especially struggling with world-model learning, memory, and planning.
LLM benchmark accuracy jumps 10% when evaluated on a cleaned-up version of Humanity's Last Exam, highlighting the significant impact of dataset noise on performance metrics.
HybridRAG-Bench reveals that existing benchmarks overestimate the reasoning abilities of retrieval-augmented LLMs due to contamination, offering a more realistic evaluation using up-to-date scientific knowledge.
GPT-5's real-time router learns to route queries to specialized models, making it faster and more useful than its predecessors.
Despite progress in AI safety, it's still largely unknown how effective current safeguards are at preventing AI harms, and their effectiveness varies wildly.