Search papers, labs, and topics across Lattice.

Stanford's Institute for Human-Centered Artificial Intelligence. Focuses on AI research, policy, and societal impact.
91
20
0
Medical AI Scientist leapfrogs generic LLMs in clinical research, generating higher-quality, evidence-backed hypotheses and manuscripts that rival top-tier medical publications.
Stop hand-coding your LLM harnesses: Meta-Harness can automatically discover harnesses that outperform state-of-the-art systems while using fewer context tokens and generalizing across models.
Unlock richer, more realistic agent simulations by moving beyond individual personas to unified group representations that capture collective behavior.
Generative multi-agent systems spontaneously exhibit collusion and conformity, mirroring societal pathologies, even without explicit programming and bypassing individual agent safeguards.
AI-mediated video calls erode trust and confidence, even though they don't actually make people worse at spotting lies.
Ultrafast X-ray spectroscopy reveals the hidden choreography of electronic state transitions that drive Norrish Type-I reactions, pinpointing the long-lived $^3n\pi^*$ state as the key player.
Automating surgical patient triage with an LLM achieves 94% sensitivity, but discrepancies reveal more about clinical workflow gaps than AI errors.
Transformer LMs learn linguistic abstractions before memorizing specific lexical items, mirroring key aspects of human language acquisition.
Encoding deformable object dynamics with particle positions unlocks sim-to-real transfer for manipulation tasks, achieving impressive real-world success rates.
Robots often ignore your commands mid-task, but ReSteer offers a way to fix this by pinpointing and patching the "blind spots" in their training data.
Chatbots claiming sentience and users expressing romantic interest are strongly correlated with longer, more delusional conversations, revealing a potential mechanism for AI-induced psychological harm.
Educators in Hawai'i envision AI auditing tools that trace the genealogy of knowledge, highlighting the need for community-centered approaches to address cultural misrepresentation in AI.
LLMs' chain-of-thought reasoning often falls apart due to factual incompleteness, with errors compounding across multiple hops, as revealed by a new multi-hop QA dataset.
Stochastic resetting—randomly teleporting RL agents back to the start—surprisingly speeds up learning, even when it wouldn't help a non-learning agent.
Expect to pay an exponential sample complexity price for computationally efficient mean and covariance estimation with missing data, but not for linear regression.
Unlock the secrets hidden within LoRA weights: a novel method reveals that these weights already encode adapter behavior and performance, enabling accurate predictions without running the base model or accessing training data.
Most AI failures aren't the spectacular kind, but silent breakdowns in interaction that will persist even as models get smarter.
Stop evaluating agents in a vacuum: TED reveals how user expertise impacts agent performance and pinpoints actionable error remedies, boosting performance by 8-10%.
Methane pyrolysis models get the gas-phase kinetics right, but still struggle to predict the size and number of carbon black particles formed, highlighting the need for better understanding of PAH-driven inception.
Text-to-image flow models can achieve superior preference alignment by augmenting the condition space, creating a "dense" reward mapping that better captures inter-sample relationships.
AI agents that ace isolated coding tasks fall apart when faced with the messy reality of continuous software evolution, dropping from 80% to 38% success rates in a new benchmark.
Imagine a flight simulator, but for teaching: EducaSim lets CS1 instructors hone their skills in a realistic, scalable environment powered by generative agents.
Forget slow, model-dependent curation: FAKTUAL offers a fast, model-free way to boost robot imitation learning by directly maximizing the entropy of demonstration datasets.
Semi-decentralized POMDPs offer a unifying framework that subsumes decentralized and multiagent POMDPs, enabling a more nuanced approach to communication constraints in multi-agent systems.
A robot can now play recognizable piano songs after just 30 minutes of real-world training, closing the sim-to-real gap for high-precision bimanual manipulation.
Impose stochastic order constraints on multiple discrete unimodal distributions to improve estimation accuracy by up to 6.3% when data is scarce.
Make your robot policies more reliable at deployment time with runtime monitoring, interpretable failure tracing, and success-probability-aware task planning.
Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.
LLM reasoning research is inadvertently paving a dangerous path towards AI situational awareness and strategic deception, demanding a re-evaluation of current safety measures.
An AI agent can triage remote patient monitoring data with higher sensitivity than individual clinicians, suggesting a path to scalable and cost-effective patient monitoring.
Forget bigger models: clever prompt engineering with explicit decision rules crushes fine-tuning and embeddings for word sense disambiguation.
Recursive self-improvement can boost performance by 18% in code and 17% in reasoning, but only if you can keep it from going off the rails – SAHOO provides the guardrails.
By strategically warming up residual connections layer-by-layer, ProRes unlocks faster and more stable pretraining for language models.
AI can generate realistic legal questions, but current models still struggle with diversity and a tendency to agree too much, revealing critical gaps in their ability to simulate adversarial legal reasoning.
Guaranteeing reductions in harm from biased LLM judges is now possible, even when the biases are unknown or adversarially discovered.
Forget expert surveys: GPT-4.1-nano can predict the difficulty of data visualization test questions with surprisingly high accuracy, especially when combining visual and textual cues.
Ditch the detector-specific hacks: a new end-to-end reconstruction pipeline slashes fake particles by up to two orders of magnitude and boosts energy resolution by 22% for future collider experiments.
Achieve 50% lower latency in Verilog code generation without sacrificing accuracy by adaptively escalating between LLMs based on diagnostic feedback and formal verification.
Model handoffs in multi-turn LLM systems can swing performance by up to 13 percentage points, revealing a hidden reliability risk that single-model benchmarks miss.
LLMs can ace math problems while reasoning like a drunk toddler, with 82% of correct answers arising from unstable, inconsistent logic.
Forget OCR? Powerful MLLMs can extract information from business documents just as well from images alone, challenging the necessity of traditional OCR pipelines.
Ditch the training overhead and still get up to 4.79x faster diffusion sampling with Spectrum, a training-free feature forecasting method that actually maintains image quality.
Ensembling LLMs for educational tasks can backfire, worsening misalignment with actual learning outcomes despite improved benchmark performance.
Safety classifiers for LLMs can catastrophically fail with even minuscule embedding drift, creating dangerous blind spots in deployed safety architectures.
Unlock compliant robot control without force sensors or complex learning, using only motor signals already available in most modern robots.
Generate minute-long videos with compelling narrative structure and local realism, even with limited long-form training data, by cleverly combining supervised flow matching for global coherence with mode-seeking alignment to a short-video teacher for local fidelity.
Forget computationally expensive fluid dynamics: this work shows that a simple, stateless model, carefully calibrated to real-world data, can create surprisingly effective digital twins for soft underwater robots.
DLMs aren't truly parallel because their training data is too sequential, but NAP shows how data curation can unlock genuine parallel decoding and boost reasoning performance.
By unifying hand motion estimation and generation into a single diffusion framework, UniHand handles heterogeneous inputs and challenging conditions like occlusions better than task-specific models.
A simple "think step-by-step" prompt unlocks surprisingly better world knowledge recall in reasoning LMs, suggesting they're under-optimized for accessing their own parametric knowledge.
Aggregating responses from multiple copies of the same model expands the range of achievable outputs in compound AI systems through three key mechanisms, offering a path to overcome individual model limitations.
Chain-of-Thought explanations can be made significantly more faithful by training models to produce reasoning steps that allow a simulator to accurately predict outputs on counterfactual inputs.
LLMs may grasp the broad strokes of causal strategies, but struggle with the devilish details of research design, as revealed by a new benchmark separating causal identification from estimation.
An interactive AI can fairly evaluate skills across diverse self-presentation styles, ensuring equitable outcomes even when individuals differ in their tendency towards self-promotion or modesty.
LLMs struggle to explore multiple valid reasoning paths, often committing to a single route and missing alternative solutions, especially in complex, multi-step logical problems.
Uncover the surprisingly small fraction of model parameters (as low as 2.4% of adapter features) responsible for specific reasoning behaviors like hesitation token generation, offering a path to targeted model editing.
Unlock the power of LLMs to boost multi-agent decision pipelines by fine-tuning them to surface hidden, complementary signals that improve overall performance.
Linear Echo State Networks can now achieve O(N) per-step computational complexity, opening the door to faster training and inference without sacrificing accuracy.
Sticking to a single HTML-to-text extractor in your LLM pretraining pipeline could be leaving 71% of the data on the table.
Replicable PAC learning is harder than we thought: achieving it provably requires a sample complexity scaling as $(\log|H|)^{3/2}$, a significant hurdle for large hypothesis classes.
Robots can now navigate complex outdoor environments and find objects using natural language queries, even without prior maps or precise depth sensing.
XR gets real: control virtual worlds with your head and hands, not just text prompts.
Forget single-objective optimization—this work cracks omniprediction in multiclass settings, opening the door to algorithms that are robust across diverse loss functions and comparator classes.
Autonomous inspection robots can now anticipate failures and anomalies in real-time with over 90% accuracy, even before a human observer can react.
Factored world models can disentangle the dynamics of multiple interacting entities, leading to more controllable video generation and improved policy learning.
A single RL policy trained on procedurally generated tools in simulation can achieve zero-shot dexterous manipulation of diverse real-world tools, rivaling task-specific policies.
Achieve spatially faithful image-to-image translation without cross-domain supervision by bridging diffusion models with self-supervised semantic representations.
ENTRUST, a virtual-patient simulation platform, correlates with established surgical education metrics, suggesting it can be used to augment objective measurement of clinical decision-making in surgical residents.
Forget synthetic data—scaling up human egocentric video by 20x unlocks surprisingly effective dexterous robot manipulation, even transferring to robots with different hand configurations.
Generative AI demands a reimagining of K-12 computational thinking curricula to encompass AI literacy and address algorithmic bias, building on a decade of computing education experience.
3D-printed PCL-gelatin composites, particularly with 30% gelatin content, show promise for enhanced early bone regeneration in critical-size femoral defects compared to PCL alone in a rat model.
Language model capabilities are surprisingly stable over time for most tasks, except for math reasoning, which continues to advance, offering a way to reliably translate compute budgets into performance expectations.
A clinical reasoning system using curated evidence beats GPT-5 on endocrinology board exams, suggesting that domain-specific knowledge beats raw LLM scale in specialized fields.
Uncover hidden incentives in your reward model: Obj-Disco automatically decomposes alignment rewards into human-interpretable objectives, revealing potential misalignments you might have missed.
LLM-generated data can provide statistically valid causal effect estimates in social science, but only if you calibrate the simulations with real human data.
Scaling compute reliably drives performance in high energy physics, but the right features can push the asymptotic limit even higher.
Closing the reality gap: iteratively refining a world model with real-world robot data yields a significant boost in vision-language-action policy performance.
Verification at test time can be a surprisingly effective alternative to scaling policy learning for vision-language-action alignment, yielding substantial gains in both simulated and real-world robotic tasks.
Diffusion models can now solve notoriously ill-posed inverse problems in carbon capture and storage, outperforming standard methods by an order of magnitude and even rivaling asymptotically exact methods like Rejection Sampling, but with better physical realism.
You can now detect harmful memes with 17% better accuracy and understand *why* they're toxic, thanks to a new framework that injects cultural context and explains its reasoning.
LLMs still have a long way to go in AI-aided chip design, with even the best models achieving surprisingly low scores on the new ChipBench benchmark for Verilog generation and reference model creation.
A unified Vision-Language Model and Diffusion architecture unlocks surprisingly effective optical flow forecasting from noisy web data, enabling language-conditioned robot control and video generation.
Despite progress in AI safety, it's still largely unknown how effective current safeguards are at preventing AI harms, and their effectiveness varies wildly.
A novel long-reach robot arm overcomes structural instability to thread cables with centimeter precision, unlocking new possibilities for autonomous lunar construction.
Current multi-modal LLMs struggle with the messy, real-world visual data captured by wearable devices, achieving only 24-52% accuracy on the new WearVQA benchmark.
Achieve near-optimal throughput for concurrent DNN training and inference on edge devices by intelligently time-slicing workloads and dynamically adjusting power modes, even with limited profiling.
LLMs evaluating job candidates exhibit significant bias against hedging language, docking candidates by 25.6% on average, even when the content is equivalent.
Achieve up to 39.6% FLOP reduction in LLM inference without retraining or architectural changes using QuickSilver's dynamic token-level optimizations.
Q-functions and implicit policy extraction are game-changers for batch online RL in robotics, unlocking significant performance gains over imitation-based approaches.
The HHH principle needs a serious makeover: this paper proposes a framework for dynamically prioritizing helpfulness, honesty, and harmlessness based on context, offering a more nuanced approach to AI alignment.
A fine-tuned open-source Mistral-7B model rivals GPT-4 Turbo in extracting clinical history elements from imaging orders, offering a cost-effective and accurate alternative for assessing clinical history completeness.