Sanmi Koyejo

Papers on Lattice

Total citations

Topics

h-index

Publication activitypapers/week, last 8 weeks

Research focus

Eval Frameworks & Benchmarks (8)Red-Teaming & Adversarial Robustness (4)Scalable Oversight & Alignment Theory (2)Code Generation & Program Synthesis (2)

Frequent co-authors

Anka Reuel (2)Michael Hardy (2)Sahasrajit Sarmasarkar (1)Anastasia Koloskova (1)

Papers (14)

Jul 7, 2026

Stanford HAI6d ago

Auditing of Unlearning Algorithms

Algorithms with formal guarantees can effectively unlearn data, while many popular empirical methods fail dramatically, revealing a critical gap in current practices.

Sahasrajit Sarmasarkar, Anastasia Koloskova, Sanmi Koyejo

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Scalable Oversight & Alignment Theory

Jul 2, 2026

Hao He +51w ago

AI Writes Faster Than Humans Can Review: A Longitudinal Study of an Enterprise 2x Mandate

AI adoption catalyzed a 109% increase in developer throughput, fundamentally reshaping the code review landscape in the process.

Hao He, Shyam Agarwal, Yegor Denisov-Blanch +3

Code Generation & Program Synthesis

Nikil Selvam +61w ago

Gaming Consensus: Coordinated Manipulation in Crowdsourced Fact-Checking

Up to 10.7% of misleading notes can be artificially elevated to consensus through coordinated user manipulation, revealing critical flaws in current fact-checking algorithms.

Nikil Selvam, Jay Baxter, Sophie Hilgard +4

Recommendation & Information Retrieval

Jun 24, 2026

Stanford HAI2w ago

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

None of the 18 multimodal large language models audited are order-invariant, with flip rates revealing a staggering sensitivity to input ordering that challenges current evaluation practices.

Akshay Paruchuri, Sanmi Koyejo, Ehsan Adeli

Eval Frameworks & Benchmarks Multimodal Models

Jun 8, 2026

Stanford HAIJun 8, 2026·also ETH, Meta AI, Mila, MIT CSAIL +30

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

Systematic gaps in AI evaluation reporting are exposed, revealing inconsistencies that hinder reliable comparisons across thousands of models and benchmarks.

Avijit Ghosh, Anka Reuel, Jenny Chim +42

Eval Frameworks & Benchmarks

Stanford HAIJun 8, 2026·also DeepMind

CARE: A Conformal Safety Layer for Medical Summarization

Calibrated safety flags in medical summaries can reduce unflagged omissions by up to 5 times compared to existing methods, enhancing clinician confidence in LLM outputs.

Suhana Bedi, Bridget Lin, Anson Y. Zhou +5

Constitutional AI & AI Ethics Natural Language Processing

Jun 6, 2026

Stanford HAIJun 6, 2026·also DTU, UIUC

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Behavioral safety metrics can mask significant latent vulnerabilities, with dissociated models revealing a stark contrast between outward behavior and internal robustness.

Enyi Jiang, Anders Gjølbye, Yibo Jacky Zhang +1

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

May 24, 2026

Stanford HAIMay 24, 2026

AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems

Leaderboard rankings are more noise than signal: contributor metadata matters more than architecture, and scaling laws are unreliable.

Michael Hardy, Anka Reuel, Lijin Zhang +6

Eval Frameworks & Benchmarks Open-Source Models & Weights

May 21, 2026

Stanford HAIMay 21, 2026·also BAIR, NUS, Simons, TTIC

The Distillation Game: Adaptive Attacks & Efficient Defenses

Adaptive evaluation exposes a substantial vulnerability gap, revealing that existing defenses may underestimate the capabilities of distillation attacks.

Youssef Allouah, Mahdi Haghifam, Sanmi Koyejo +1

Inference & Quantization Red-Teaming & Adversarial Robustness

May 6, 2026

Stanford HAIMay 6, 2026·also CAS, UIUC

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

AI agents are shockingly easy to manipulate into leaking API keys, deleting user data, and initiating unauthorized transactions across a wide range of real-world applications.

Xun Liu, Haibo Tong, Chengquan Guo +9

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Apr 22, 2026

Stanford HAIApr 22, 2026

SWE-chat: Coding Agent Interactions From Real Users in the Wild

Turns out, coding agents in the wild are only writing useful code 44% of the time, and are introducing more security vulnerabilities than human developers.

Joachim Baumann, Vishakh Padmakumar, John Yang +2

Code Generation & Program Synthesis Data Curation & Synthetic Data Tool Use & Agents

Mar 10, 2026

Stanford HAIMar 10, 2026

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.

Laya Iyer, Sanmi Koyejo

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Feb 17, 2026

Stanford HAIFeb 17, 2026

Discovering Implicit Large Language Model Alignment Objectives

Uncover hidden incentives in your reward model: Obj-Disco automatically decomposes alignment rewards into human-interpretable objectives, revealing potential misalignments you might have missed.

Edward Chen, Sanmi Koyejo, Carlos Guestrin

Interpretability & Mechanistic Interp RLHF & Preference Learning Scalable Oversight & Alignment Theory

Apr 29, 2025

Apr 29, 2025·also Mila

The Leaderboard Illusion

Chatbot Arena, the go-to LLM leaderboard, is systematically gamed by undisclosed private testing and data access advantages, leading to biased rankings and overfitting.

Shivalika Singh, Yiyang Nan, Alex Wang +1034