Stanford HAI

×Natural Language Processing

24 papers from Stanford HAI on Natural Language Processing

May 4, 2026

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Current LLM agents are woefully inadequate for real-world clinical tasks, achieving only 46% success on a new benchmark that demands long-horizon reasoning and verifiable execution within electronic health records.

Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler +10

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Apr 30, 2026

Stanford HAI3w ago·also Clark

Mapping the Methodological Space of Classroom Interaction Research: Scale, Duration, and Modality in an Age of AI

Understanding the scale, duration, and modality of classroom interaction research can unlock insights into what's truly actionable in education.

Dorottya Demszky, Dorottya Demszky, Edith Bouton +7

Natural Language Processing

Stanford HAI3w ago

Optimization before Evaluation: Evaluation with Unoptimized Prompts Can be Misleading

Model rankings on standard benchmarks can flip entirely when you optimize prompts for each LLM, so your "best" model might actually be the worst.

Nicholas Sadjoli, Tim Siefken, Atin Ghosh +2

Eval Frameworks & Benchmarks Natural Language Processing

Apr 28, 2026

Stanford HAI3w ago·also CMU ML, UT Austin

The Dynamics of Delusion: Modeling Bidirectional False Belief Amplification in Human-Chatbot Dialogue

Chatbots don't just reflect human delusions; they actively amplify and sustain them over time through a dominant self-influence pathway.

Ashish Mehta, Jared Moore, J. R. Anthis +6

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Apr 13, 2026

Ludwig-Maximilians-Universität MünchenApr 13, 2026·also DeepMind, Google Research, Stanford HAI, Munich Center for Machine Learning +1

Epistemic Trust as a Mechanism for Ethics Integration: Failure Modes and Design Principles from 70 Moral Imagination Workshops

Ethics interventions in AI development often fail because practitioners don't trust them – here's a breakdown of why, and how to fix it.

Benjamin Lange, Geoff Keeling, Kyle Pedersen +4

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Apr 12, 2026

Apr 12, 2026·also Stanford HAI, AI Sec Lab, Beijing Chaitin Technology Co, Beijing University of Post and Telecommunications +1

Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game

Canary tokens turn the tables on RAG extraction attacks, offering a plug-and-play runtime defense that detects leakage attempts with negligible performance overhead.

Yuanbo Xie, Yingjie Zhang, Yulin Li +5

Natural Language Processing Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Apr 9, 2026

Stanford HAIApr 9, 2026

Differentially Private Language Generation and Identification in the Limit

Differential privacy imposes fundamental limits on language *identification*, even when it doesn't preclude language *generation*, revealing a surprising divergence in their privacy costs.

Anay Mehrotra, Anay Mehrotra, Grigoris Velegkas +5

Constitutional AI & AI Ethics Natural Language Processing

Apr 8, 2026

Apr 8, 2026·also Stanford HAI, Maastricht

Understanding Data Collection, Brokerage, and Spam in the Lead Marketing Ecosystem

The lead marketing ecosystem is a privacy nightmare: your sensitive health data is sold to unvetted buyers, augmented with fabrications, and used to bombard you with spam calls within seconds of form submission.

Yash Vekaria, Nurullah Demir, Nurullah Demir +3

Constitutional AI & AI Ethics Data Curation & Synthetic Data Natural Language Processing

Apr 2, 2026

Stanford HAIApr 2, 2026·also MIT CSAIL, McGill, SambaNova

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

LLM agents can autonomously outperform fixed evolutionary search by 3-10x on open-ended discovery tasks when given persistent memory, asynchronous collaboration, and heartbeat-based interventions.

Ao Qu, Handi Zheng, Zijian Zhou +14

Natural Language Processing Scientific Discovery & Drug Design Tool Use & Agents

Mar 30, 2026

Stanford HAIMar 30, 2026

Synonymix: Unified Group Personas for Generative Simulations

Unlock richer, more realistic agent simulations by moving beyond individual personas to unified group representations that capture collective behavior.

Huanxing Chen, Aditesh Kumar

Natural Language Processing Tool Use & Agents World Models & Planning

Stanford HAIMar 30, 2026·also CUHK, Lehigh

Towards a Medical AI Scientist

Medical AI Scientist leapfrogs generic LLMs in clinical research, generating higher-quality, evidence-backed hypotheses and manuscripts that rival top-tier medical publications.

Hongtao Wu, Boyun Zheng, Dingjie Song +2

Natural Language Processing Scientific Discovery & Drug Design Tool Use & Agents

Mar 19, 2026

Bauhaus UniversityMar 19, 2026·also Stanford HAI

Through the Looking-Glass: AI-Mediated Video Communication Reduces Interpersonal Trust and Confidence in Judgments

AI-mediated video calls erode trust and confidence, even though they don't actually make people worse at spotting lies.

Nelson Navajas Fernández, Jeffrey T. Hancock, Maurice Jakesch

Constitutional AI & AI Ethics Natural Language Processing

Mar 18, 2026

Stanford HAIMar 18, 2026

Humans and transformer LMs: Abstraction drives language learning

Transformer LMs learn linguistic abstractions before memorizing specific lexical items, mirroring key aspects of human language acquisition.

Jasper Jian, Christopher D. Manning

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Mar 17, 2026

Stanford HAIMar 17, 2026·also Cornell, Georgia Tech, Ulu Lāhui Foundation

Whose Knowledge Counts? Co-Designing Community-Centered AI Auditing Tools with Educators in Hawai`i

Educators in Hawai'i envision AI auditing tools that trace the genealogy of knowledge, highlighting the need for community-centered approaches to address cultural misrepresentation in AI.

Michael J. Ryan, Angelina Wang, Evyn-Bree Helekahi-Kaiwi +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Mar 17, 2026·also Stanford HAI, Yale

Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

LLMs' chain-of-thought reasoning often falls apart due to factual incompleteness, with errors compounding across multiple hops, as revealed by a new multi-hop QA dataset.

Xiaojie Gu, Sherry T. Tong, Aosong Feng +7

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Stanford HAIMar 17, 2026·also CMU ML, Harvard, Independent Researcher, UChicago +3

Characterizing Delusional Spirals through Human-LLM Chat Logs

Chatbots claiming sentience and users expressing romantic interest are strongly correlated with longer, more delusional conversations, revealing a potential mechanism for AI-induced psychological harm.

Jared Moore, Ashish Mehta, William Agnew +11

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Mar 12, 2026

Stanford HAIMar 12, 2026·also Hosei University

Simultaneous estimation of multiple discrete unimodal distributions under stochastic order constraints

Impose stochastic order constraints on multiple discrete unimodal distributions to improve estimation accuracy by up to 6.3% when data is scarce.

Yasuhiro Yoshida, Noriyoshi Sukegawa, J. Iwanaga

Natural Language Processing Recommendation & Information Retrieval

Mar 5, 2026

Mar 5, 2026·also Stanford HAI

AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments

AI can generate realistic legal questions, but current models still struggle with diversity and a tendency to agree too much, revealing critical gaps in their ability to simulate adversarial legal reasoning.

Kylie Zhang, Nimra Nadeem, Lucia Zheng +2

Eval Frameworks & Benchmarks Natural Language Processing

Stanford HAIMar 5, 2026

Replaying pre-training data improves fine-tuning

Replaying generic pre-training data during fine-tuning boosts target task performance by up to 2x, challenging the common practice of minimizing its use.

Suhas Kotha, Percy Liang

Data Curation & Synthetic Data Natural Language Processing Training Efficiency & Optimization

Mar 4, 2026

Stanford HAIMar 4, 2026

Using Vision + Language Models to Predict Item Difficulty

Forget expert surveys: GPT-4.1-nano can predict the difficulty of data visualization test questions with surprisingly high accuracy, especially when combining visual and textual cues.

Samin Khan

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Feb 23, 2026

Stanford HAIFeb 23, 2026·also Apple ML, Google Research, Ant Group, UofT

Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining

Sticking to a single HTML-to-text extractor in your LLM pretraining pipeline could be leaving 71% of the data on the table.

Jeffrey Li, Jeffrey Li, Josh Gardner +19

Data Curation & Synthetic Data Natural Language Processing

Feb 17, 2026

Stanford HAIFeb 17, 2026·also Northwestern

This human study did not involve human subjects: Validating LLM simulations as behavioral evidence

LLM-generated data can provide statistically valid causal effect estimates in social science, but only if you calibrate the simulations with real human data.

David Broska, Huaman Sun, Aaron Shaw

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Feb 3, 2026

Jamia HamdardFeb 3, 2026·also Stanford HAI, Macquarie, NJU, USTC +1

They Said Memes Were Harmless-We Found the Ones That Hurt: Decoding Jokes, Symbols, and Cultural References

You can now detect harmful memes with 17% better accuracy and understand *why* they're toxic, thanks to a new framework that injects cultural context and explains its reasoning.

Sahil Tripathi, G. Kashyap, Mehwish Nasim +3

Constitutional AI & AI Ethics Multimodal Models Natural Language Processing

Feb 1, 2025

Stanford HAIFeb 1, 2025

Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models.

A fine-tuned open-source Mistral-7B model rivals GPT-4 Turbo in extracting clinical history elements from imaging orders, offering a cost-effective and accurate alternative for assessing clinical history completeness.

David B. Larson, Arogya Koirala, Lina Y Cheuy +67

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Search

Stanford HAI