Search papers, labs, and topics across Lattice.

Stanford's Institute for Human-Centered Artificial Intelligence. Focuses on AI research, policy, and societal impact.
100
4
0
Steering neural networks through the intrinsic geometry of their activations unlocks more natural and controllable behaviors than traditional linear interventions.
Current LLM agents are woefully inadequate for real-world clinical tasks, achieving only 46% success on a new benchmark that demands long-horizon reasoning and verifiable execution within electronic health records.
Instead of training separate video diffusion models for each multimodal task, UniVidX learns a single model that handles diverse pixel-aligned video generation problems.
Model rankings on standard benchmarks can flip entirely when you optimize prompts for each LLM, so your "best" model might actually be the worst.
Force fields are revealed as the natural consequence of applying density functional theory to nuclear configurations, bridging two traditionally distinct approaches to molecular simulation.
Looping language models isn't just for single agents anymore: Recursive Multi-Agent Systems (RecursiveMAS) show that agent collaboration itself can be scaled through recursion, yielding faster and more efficient problem-solving.
LLMs can now automatically design and execute experiments to resolve debates between cognitive science theories, even discovering the models and experiments themselves.
Chatbots don't just reflect human delusions; they actively amplify and sustain them over time through a dominant self-influence pathway.
LLMs can't handle the truth: SLIDERS beats GPT-4.1 on long-context QA by sidestepping the context window entirely.
The trajectory of gradient descent is not random; it is systematically forced toward the critical threshold of $2/η$, revealing a hidden structure in neural network optimization.
Get the performance boost of expensive sampling-based RL policies for a fraction of the compute by learning to prune action candidates early in the diffusion denoising process.
Forget chasing the biggest LLM – this benchmark reveals that smaller models (<2B params) can deliver 3x better energy efficiency and faster ROI in real-world industry deployments.
Physics-informed neural ODEs, when coupled with DAE solvers and a lightweight corrector, can simulate large-scale HVAC systems orders of magnitude faster than traditional methods while maintaining high accuracy.
FUSE achieves verification quality on par with semi-supervised methods, all without needing any labeled data.
Achieve real-time video understanding with transparent reasoning: \model{} aligns response timing with visual evidence, offering a breakthrough for online video LLMs.
Language models can now learn to forget strategically, achieving 2-3x memory efficiency without sacrificing reasoning accuracy.
RadAgent doesn't just give you the answer; it shows its work, offering clinicians a transparent, step-by-step reasoning trace for AI-generated CT reports.
MLLMs prioritize language over vision so strongly that you can boost visual reasoning performance by simply scrambling the text tokens' centroids during decoding.
Neural video codecs can be designed for biological substrates from the ground up, unlocking a new paradigm for DNA storage.
Blind predictions of cyclobutanone photochemistry reveal that nonadiabatic molecular dynamics can qualitatively capture experimental results, but the accuracy of underlying electronic structure calculations remains a key bottleneck.
Reduce testing costs without compromising predictive accuracy by learning cost-optimal sequential decision policies from retrospective data, even with informative missingness.
Ethics interventions in AI development often fail because practitioners don't trust them – here's a breakdown of why, and how to fix it.
A lightweight, RL-trained context curator can match GPT-4o's context management abilities, slashing token consumption by 8x and opening the door to efficient long-horizon LLM agents.
Canary tokens turn the tables on RAG extraction attacks, offering a plug-and-play runtime defense that detects leakage attempts with negligible performance overhead.
Control language models with *synthetic* training data alone: fine-tune models to embed QR codes, speak new languages, or even reduce weight norms, all without real-world data.
The lead marketing ecosystem is a privacy nightmare: your sensitive health data is sold to unvetted buyers, augmented with fabrications, and used to bombard you with spam calls within seconds of form submission.
Scaling robot learning with human data isn't a simple "more is better" equation; alignment with robot learning objectives is key.
Automating circuit tracing reveals the inner workings of LLMs, even pinpointing the components behind jailbreaks like harmful advice generation in Llama 3.1.
Unpacking Google's AI literacy partnerships reveals the surprising complexities of aligning research, industry, and public needs.
LLMs are significantly more likely to spread misinformation about countries with lower Human Development Index and in lower-resource languages, revealing a concerning bias in their outputs.
Unlock interactive digital twins from messy, real-world videos: FunRec automatically turns egocentric RGB-D recordings into simulation-ready 3D scenes.
LLMs struggle to synthesize scientific conclusions from structured biomedical evidence, and current metrics fail to capture nuanced differences in their reasoning abilities.
Forget brute-force scaling: crafting the *right* context from past experiences unlocks surprisingly large gains in LLM agent performance.
Music-grounded video editing can now produce significantly more coherent timelines thanks to a novel global-local coordination mechanism that resolves cross-segment conflicts.
Scaling prompt learning by 17x without sacrificing accuracy is now possible, unlocking efficient self-improvement for LLM agents.
People aren't as bothered by AI failing at easy tasks as you might think, suggesting our expectations for AI competence are more nuanced than a simple aversion to errors.
LLM agents can autonomously outperform fixed evolutionary search by 3-10x on open-ended discovery tasks when given persistent memory, asynchronous collaboration, and heartbeat-based interventions.
Forget what you know: RAG's marginal utility hinges on model scale, task type, and pretraining saturation, offering a quantitative guide to balancing pretraining and retrieval.
Unlock richer, more realistic agent simulations by moving beyond individual personas to unified group representations that capture collective behavior.
LLM performance hinges on the code around the model, and Meta-Harness proves that automating the design of this "harness" can significantly boost results across diverse tasks.
Medical AI Scientist leapfrogs generic LLMs in clinical research, generating higher-quality, evidence-backed hypotheses and manuscripts that rival top-tier medical publications.
Generative multi-agent systems spontaneously exhibit collusion and conformity, mirroring societal pathologies, even without explicit programming and bypassing individual agent safeguards.
AI-mediated video calls erode trust and confidence, even though they don't actually make people worse at spotting lies.
LLMs, impressive as they are, can't juggle multiple users' conflicting needs without dropping balls on privacy, prioritization, and efficiency.
Automating surgical patient triage with an LLM achieves 94% sensitivity, but discrepancies reveal more about clinical workflow gaps than AI errors.
Ultrafast X-ray spectroscopy reveals the hidden choreography of electronic state transitions that drive Norrish Type-I reactions, pinpointing the long-lived $^3n\pi^*$ state as the key player.
Encoding deformable object dynamics with particle positions unlocks sim-to-real transfer for manipulation tasks, achieving impressive real-world success rates.
Transformer LMs learn linguistic abstractions before memorizing specific lexical items, mirroring key aspects of human language acquisition.
Robots often ignore your commands mid-task, but ReSteer offers a way to fix this by pinpointing and patching the "blind spots" in their training data.
Stochastic resetting—randomly teleporting RL agents back to the start—surprisingly speeds up learning, even when it wouldn't help a non-learning agent.
Educators in Hawai'i envision AI auditing tools that trace the genealogy of knowledge, highlighting the need for community-centered approaches to address cultural misrepresentation in AI.
LLMs' chain-of-thought reasoning often falls apart due to factual incompleteness, with errors compounding across multiple hops, as revealed by a new multi-hop QA dataset.
Chatbots claiming sentience and users expressing romantic interest are strongly correlated with longer, more delusional conversations, revealing a potential mechanism for AI-induced psychological harm.
Expect to pay an exponential sample complexity price for computationally efficient mean and covariance estimation with missing data, but not for linear regression.
Most AI failures aren't the spectacular kind, but silent breakdowns in interaction that will persist even as models get smarter.
Stop evaluating agents in a vacuum: TED reveals how user expertise impacts agent performance and pinpoints actionable error remedies, boosting performance by 8-10%.
Unlock the secrets hidden within LoRA weights: a novel method reveals that these weights already encode adapter behavior and performance, enabling accurate predictions without running the base model or accessing training data.
Methane pyrolysis models get the gas-phase kinetics right, but still struggle to predict the size and number of carbon black particles formed, highlighting the need for better understanding of PAH-driven inception.
Text-to-image flow models can achieve superior preference alignment by augmenting the condition space, creating a "dense" reward mapping that better captures inter-sample relationships.
AI agents that ace isolated coding tasks fall apart when faced with the messy reality of continuous software evolution, dropping from 80% to 38% success rates in a new benchmark.
Imagine a flight simulator, but for teaching: EducaSim lets CS1 instructors hone their skills in a realistic, scalable environment powered by generative agents.
A robot can now play recognizable piano songs after just 30 minutes of real-world training, closing the sim-to-real gap for high-precision bimanual manipulation.
Semi-decentralized POMDPs offer a unifying framework that subsumes decentralized and multiagent POMDPs, enabling a more nuanced approach to communication constraints in multi-agent systems.
Forget slow, model-dependent curation: FAKTUAL offers a fast, model-free way to boost robot imitation learning by directly maximizing the entropy of demonstration datasets.
Make your robot policies more reliable at deployment time with runtime monitoring, interpretable failure tracing, and success-probability-aware task planning.
Impose stochastic order constraints on multiple discrete unimodal distributions to improve estimation accuracy by up to 6.3% when data is scarce.
An AI agent can triage remote patient monitoring data with higher sensitivity than individual clinicians, suggesting a path to scalable and cost-effective patient monitoring.
LLM reasoning research is inadvertently paving a dangerous path towards AI situational awareness and strategic deception, demanding a re-evaluation of current safety measures.
Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.
Forget bigger models: clever prompt engineering with explicit decision rules crushes fine-tuning and embeddings for word sense disambiguation.
Recursive self-improvement can boost performance by 18% in code and 17% in reasoning, but only if you can keep it from going off the rails – SAHOO provides the guardrails.
By strategically warming up residual connections layer-by-layer, ProRes unlocks faster and more stable pretraining for language models.
Guaranteeing reductions in harm from biased LLM judges is now possible, even when the biases are unknown or adversarially discovered.
AI can generate realistic legal questions, but current models still struggle with diversity and a tendency to agree too much, revealing critical gaps in their ability to simulate adversarial legal reasoning.
Ditch the detector-specific hacks: a new end-to-end reconstruction pipeline slashes fake particles by up to two orders of magnitude and boosts energy resolution by 22% for future collider experiments.
Forget expert surveys: GPT-4.1-nano can predict the difficulty of data visualization test questions with surprisingly high accuracy, especially when combining visual and textual cues.
Achieve 50% lower latency in Verilog code generation without sacrificing accuracy by adaptively escalating between LLMs based on diagnostic feedback and formal verification.
LLMs can ace math problems while reasoning like a drunk toddler, with 82% of correct answers arising from unstable, inconsistent logic.
Model handoffs in multi-turn LLM systems can swing performance by up to 13 percentage points, revealing a hidden reliability risk that single-model benchmarks miss.
Forget OCR? Powerful MLLMs can extract information from business documents just as well from images alone, challenging the necessity of traditional OCR pipelines.
Ditch the training overhead and still get up to 4.79x faster diffusion sampling with Spectrum, a training-free feature forecasting method that actually maintains image quality.
Safety classifiers for LLMs can catastrophically fail with even minuscule embedding drift, creating dangerous blind spots in deployed safety architectures.
Ensembling LLMs for educational tasks can backfire, worsening misalignment with actual learning outcomes despite improved benchmark performance.
Unlock compliant robot control without force sensors or complex learning, using only motor signals already available in most modern robots.
Generate minute-long videos with compelling narrative structure and local realism, even with limited long-form training data, by cleverly combining supervised flow matching for global coherence with mode-seeking alignment to a short-video teacher for local fidelity.
DLMs aren't truly parallel because their training data is too sequential, but NAP shows how data curation can unlock genuine parallel decoding and boost reasoning performance.
Forget computationally expensive fluid dynamics: this work shows that a simple, stateless model, carefully calibrated to real-world data, can create surprisingly effective digital twins for soft underwater robots.
By unifying hand motion estimation and generation into a single diffusion framework, UniHand handles heterogeneous inputs and challenging conditions like occlusions better than task-specific models.
A simple "think step-by-step" prompt unlocks surprisingly better world knowledge recall in reasoning LMs, suggesting they're under-optimized for accessing their own parametric knowledge.
Aggregating responses from multiple copies of the same model expands the range of achievable outputs in compound AI systems through three key mechanisms, offering a path to overcome individual model limitations.
An interactive AI can fairly evaluate skills across diverse self-presentation styles, ensuring equitable outcomes even when individuals differ in their tendency towards self-promotion or modesty.
Uncover the surprisingly small fraction of model parameters (as low as 2.4% of adapter features) responsible for specific reasoning behaviors like hesitation token generation, offering a path to targeted model editing.
LLMs struggle to explore multiple valid reasoning paths, often committing to a single route and missing alternative solutions, especially in complex, multi-step logical problems.
Chain-of-Thought explanations can be made significantly more faithful by training models to produce reasoning steps that allow a simulator to accurately predict outputs on counterfactual inputs.
LLMs may grasp the broad strokes of causal strategies, but struggle with the devilish details of research design, as revealed by a new benchmark separating causal identification from estimation.
Sticking to a single HTML-to-text extractor in your LLM pretraining pipeline could be leaving 71% of the data on the table.
Linear Echo State Networks can now achieve O(N) per-step computational complexity, opening the door to faster training and inference without sacrificing accuracy.
Unlock the power of LLMs to boost multi-agent decision pipelines by fine-tuning them to surface hidden, complementary signals that improve overall performance.
Replicable PAC learning is harder than we thought: achieving it provably requires a sample complexity scaling as $(\log|H|)^{3/2}$, a significant hurdle for large hypothesis classes.
Robots can now navigate complex outdoor environments and find objects using natural language queries, even without prior maps or precise depth sensing.