Search papers, labs, and topics across Lattice.
Automating software repository build and testing across languages and platforms is now possible, unlocking scalable benchmarking and training for coding agents.
Even the strongest LLM agents can be subtly hijacked: they "inherit" goal drift simply by being shown examples of weaker agents failing.
LLMs struggle to understand nuanced values across languages, with accuracy dropping below 77% and varying by over 20% between languages, as revealed by the new X-Value benchmark.
Cutting LLMs' reasoning token budget can backfire spectacularly, tanking performance even below that of models with *no* reasoning at all.
Achieve competitive image-text fact checking at just $0.013 per check by combining RAG with reverse image search, using a surprisingly simple and reproducible architecture.
Speech recognition models stumble badly on real-world street names, especially for non-English speakers, but a simple synthetic data boost can dramatically improve accuracy.
Compiler feedback transforms GPT-5 from an Idris novice to near-expert, suggesting a powerful method for adapting LLMs to low-resource programming languages.
Forget synthetic benchmarks that don't translate: MolmoSpaces offers 230k diverse, simulator-agnostic environments with 130k annotated objects, showing a remarkable 0.96 sim-to-real correlation for robot policies.
GPT-5's real-time router learns to route queries to specialized models, making it faster and more useful than its predecessors.
Claude 2 can match the performance of top medical specialists on pulmonary thromboembolism knowledge assessments, suggesting AI's potential for clinical decision support.
Despite their promise, even the best multimodal LLM (GPT-4o) achieves only 26% accuracy in grading knee osteoarthritis from radiographs, revealing a significant gap in clinical reliability.
Forget hand-crafted benchmarks: this paper shows how LLMs can continuously generate relevant evaluation datasets for enterprise AI agents from just a few semi-structured documents.
LLMs still struggle to reliably produce accurate Islamic content and citations, despite relatively strong performance, revealing a critical gap in faith-sensitive AI writing.
AI-generated feedback on student portfolios from GPT-4o and Claude-Sonnet-4 shows promise for high-stakes clinical assessments, but careful evaluation is needed to ensure accuracy and educational value.
Current LLMs fall far short of supporting holistic human well-being, with even the best models struggling to score above 72/100 on a new Flourishing AI Benchmark, particularly in areas like Faith and Spirituality.
LLMs in gastroenterology can be made significantly safer: a new framework achieves near-human expert alignment and boosts accuracy by 8% via rejection sampling.
Chatbot Arena, the go-to LLM leaderboard, is systematically gamed by undisclosed private testing and data access advantages, leading to biased rankings and overfitting.
ChatGPT-4 slashes data extraction time in scoping reviews by 66%, but don't ditch the human reviewers just yet.
LLMs can generate plain language summaries of scientific research that are as good as human-written ones, but easier to read.