University of Washington

Modular training with BAR allows independent updates of domain experts, achieving superior performance without the pitfalls of catastrophic forgetting.

Jacob Morrison, Sanjay Adhikesaven, Akshita Bhagia +3

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Apr 19, 2026

Microsoft Research6d ago·also Google Research, UW, AI for Good, Department of Biochemistry Institute for Protein +1

RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design

RosettaSearch recovers up to 68% more structural fidelity in protein designs, transforming how we optimize sequences beyond traditional single-pass methods.

Meghana Kshirsagar, Allen Nie, Ching-An Cheng +5

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Scientific Discovery & Drug Design

Apr 15, 2026

1w ago·also UW

Complex Interpolation of Matrices with an application to Multi-Manifold Learning

Geometric matrix interpolation reveals hidden common structures in multi-view data, offering a new lens for multi-manifold learning.

Adi Arbel, Ronen Talmon

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design

Apr 14, 2026

UW1w ago·also Alpinference, Ben-Gurion University of the Negev

Universal NER v2: Towards a Massively Multilingual Named Entity Recognition Benchmark

Massively multilingual NER just got easier: UNER v2 offers a standardized benchmark for evaluating LLMs across diverse languages.

Terra Blevins, Stephen Mayhew, Marek vSuppa +11

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Apr 13, 2026

1w ago·also Microsoft Research, UW

Discourse Diversity in Multi-Turn Empathic Dialogue

LLMs are twice as likely as humans to repeat the same support tactic in a conversation, but a simple RL reward for tactic novelty can fix it.

Hongli Zhan, Emma S. Gueorguieva, Javier Hernandez +3

Eval Frameworks & Benchmarks Natural Language Processing

Apr 9, 2026

AI22w ago·also UW, Cornell, JHU, Paul G. Allen School of Computer Science

WildDet3D: Scaling Promptable 3D Detection in the Wild

Forget training on closed sets: WildDet3D leverages geometric cues and diverse prompts to achieve SOTA 3D object detection across 13.5K categories in the wild.

Weikai Huang, Jieyu Zhang, Sijun Li +12

Computer Vision Multimodal Models Robotics & Embodied AI

UW2w ago·also CMU ML, Columbia, CUHK, Harvard +2

Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

Achieving robust brain decoding across subjects without any retraining could revolutionize how we interpret neural signals in diverse populations.

Mu Nan, Muquan Yu, Weijian Mai +10

Computer Vision Multimodal Models Scientific Discovery & Drug Design+1

Apr 8, 2026

UW2w ago·also CAS, Hellogroup, Jilin, ShanghaiTech +1

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

MLLMs can be tricked into missing 90% of harmful content simply by encoding it in images that humans can easily read.

Zhiheng Li, Zongyang Ma, Yuntong Pan +6

Constitutional AI & AI Ethics Multimodal Models Red-Teaming & Adversarial Robustness

UW2w ago

DiffuMask: Diffusion Language Model for Token-level Prompt Pruning

Get 80% of your prompt length back without sacrificing accuracy using a diffusion-based pruning method that can mask multiple tokens at once.

Jyotika Singh, Fang Tu, Weiyi Sun +4

Inference & Quantization Natural Language Processing Reasoning & Chain-of-Thought

Apr 6, 2026

2w ago·also NVIDIA, UW, Cisco Research

GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

Serving both image and video diffusion models on the same hardware? GENSERVE's step-level preemption and dynamic resource allocation can boost your service level agreement (SLA) attainment by up to 44%.

Zhangke Li, Triston Cao, Myungjin Lee

Computer Vision Distributed Systems & Hardware Inference & Quantization

China University of Mining and Technology-Beijing2w ago·also Meta AI, UW

Rethinking Model Efficiency: Multi-Agent Inference with Large Models

Forget scaling laws: a large VLM strategically paired with a smaller model's reasoning tokens can rival the performance of a much larger, monolithic model.

SiXun Dong, Juhua Hu, Steven Li +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Multimodal Models

Apr 1, 2026

UW3w ago

ProTPS: Prototype-Guided Text Prompt Selection for Continual Learning

Forget catastrophic forgetting: ProTPS leverages vision prototypes to guide text prompt learning, achieving near-upper-bound performance in continual learning scenarios.

Keith Fuller

Natural Language Processing Training Efficiency & Optimization

UW3w ago·also SUTD, University of California Los Angeles

Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

Forget hand-designed agent communication topologies: Agent Q-Mix learns decentralized communication strategies that boost accuracy and token efficiency in LLM multi-agent systems.

Eric Hanchen Jiang, Levina Li, Xiao Liang +8

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Mar 30, 2026

Wigner Research Centre for Physics3w ago·also NVIDIA, UW, Dynaflex LTD, Eötvös Loránd University +3

Hunting for quantum advantage in electronic structure calculations is a highly non-trivial task

Claims of quantum advantage in electronic structure calculations must now contend with DMRG benchmarks achieving CAS(89,102) on Fe$_5$S$_{12}$H$_4^{5-}$, pushing the boundaries of classical computation.

Örs Legeza, Andor Menczer, Miklós Antal Werner +6

Scientific Discovery & Drug Design

Mar 29, 2026

UW3w ago·also AI2, Microsoft Research, Stanford HAI, Bake AI +5

Emergent Social Intelligence Risks in Generative Multi-Agent Systems

Generative multi-agent systems spontaneously exhibit collusion and conformity, mirroring societal pathologies, even without explicit programming and bypassing individual agent safeguards.

Wenjie Wang, Yuchen Ma, Zichen Chen +4

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Mar 27, 2026

Tsinghua AIMar 27, 2026·also UW

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

Today's best MLLMs are stumped by PerceptionComp, a new video reasoning benchmark where answering questions requires piecing together visual evidence across time and space.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Mar 11, 2026

UWMar 11, 2026

COMIC: Agentic Sketch Comedy Generation

AI can now (almost) write and direct Saturday Night Live.

Computer Vision Multimodal Models Tool Use & Agents

AI2Mar 11, 2026·also UW

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

Agentic search gets a meta-RL boost: MR-Search learns to self-reflect and adapt search strategies across episodes, significantly outperforming standard RL baselines.

Teng Xiao, Yige Yuan, Hamish Ivison +6

Recommendation & Information Retrieval Tool Use & Agents World Models & Planning

UWMar 11, 2026

"I followed what felt right, not what I was told": Autonomy, Coaching, and Recognizing Bias Through AI-Mediated Dialogue

AI interventions designed to combat ableism can backfire, as biased nudges were often rejected and increased negativity, while inclusive nudges proved more effective as scaffolding for learning.

Constitutional AI & AI Ethics Natural Language Processing

Mar 10, 2026

UWMar 10, 2026

Understanding the Use of a Large Language Model-Powered Guide to Make Virtual Reality Accessible for Blind and Low Vision People

LLM-powered VR guides for blind and low vision users are not just tools, but social actors, prompting users to give them nicknames and rationalize their mistakes when others are present.

Natural Language Processing Tool Use & Agents

Mar 5, 2026

Google ResearchMar 5, 2026·also Apple ML, UW

Dark3R: Learning Structure from Motion in the Dark

See in the dark: Dark3R unlocks structure from motion at signal-to-noise ratios below -4dB, where existing methods completely break down.

Andrew Y Guo, SaiKiran Tedla, Kyros Kutulakos

Computer Vision Robotics & Embodied AI Training Efficiency & Optimization

Mar 3, 2026

UWMar 3, 2026

Extending the Formalism and Theoretical Foundations of Cryptography to AI

Existing AI agent permissioning schemes are hard to compare, so this paper provides a formal foundation and reveals a fundamental conflict between training data confidentiality and agent completeness.

F. Villa, F. Durak, Tadayoshi Kohno +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Mar 2, 2026

Mar 2, 2026·also AI2, MIT CSAIL, NVIDIA, Stanford HAI +5

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Learning robotic reward functions from a million trajectories reveals that comparing entire trajectories, not just individual frames, unlocks better generalization and learning from suboptimal data.

Anthony Liang, Jiahui Zhang, Minyoung Hwang +11

RLHF & Preference Learning Robotics & Embodied AI

AI2Mar 2, 2026·also UW, Fred Hutchinson Cancer Center, Independent Researcher, Pancreatic Cancer Action Network

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

LLMs still struggle with factual accuracy in specialized medical domains like pancreatic cancer, with hallucination rates varying wildly and web search integration failing to guarantee better responses.

Scott Geng, Fatima Zelada-arenas, Alejandra Alvarez +6

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Feb 26, 2026

MilaFeb 26, 2026·also UW, Clemson University

Towards Dynamic Dense Retrieval with Routing Strategy

Forget full fine-tuning: this dynamic routing strategy lets you adapt dense retrieval to new domains while using just 2% of the parameters.

Zhan Su, Fengran Mo, Jinghan Zhang +4

Natural Language Processing Recommendation & Information Retrieval Training Efficiency & Optimization

Feb 25, 2026

Feb 25, 2026·also UW

Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels

Hyperspectral video, previously limited by motion artifacts and poor photon utilization, now achieves real-time capture and improved fidelity thanks to active illumination and coded-exposure pixels.

Dhruv Verma, Andrew Qiu, Roberto Rangel +6

Computer Vision Robotics & Embodied AI

UWFeb 25, 2026

Revisiting the Bertrand Paradox via Equilibrium Analysis of No-regret Learners

No-regret learning in repeated Bertrand games can lead to surprisingly high prices, challenging classical game theory's low-price predictions.

Arnab Maiti, Junyan Liu, Kevin Jamieson

Natural Language Processing

Feb 22, 2026

NVIDIAFeb 22, 2026·also AI2, Meta AI, UW

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

Unlock robot learning with hidden knowledge: TOPReward extracts surprisingly accurate task progress signals directly from VLM token probabilities, bypassing the need for explicit reward engineering.

Shirui Chen, Shirui Chen, Cole Harrison +7

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

UWFeb 22, 2026·also Cornell

Learning to Detect Language Model Training Data via Active Reconstruction

Forget passively analyzing model outputs – this new attack actively *trains* the model to regurgitate specific texts, revealing its training data with surprising accuracy.

Junjie Oscar Yin, John X. Morris, John X. Morris +6

Data Curation & Synthetic Data Natural Language Processing Red-Teaming & Adversarial Robustness

UWFeb 22, 2026·also UMass

Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

Key contribution not extracted.

Maryam Amirizaniani

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Feb 19, 2026

UWFeb 19, 2026

Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

LMs can learn some human-like linguistic biases from synthetic data, but surprisingly fail to reproduce the strong object preference seen in differential argument marking across human languages.

Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld

Data Curation & Synthetic Data Natural Language Processing

Feb 17, 2026

UWFeb 17, 2026·also UChicago

Unforgeable Watermarks for Language Models via Robust Signatures

Stop worrying about false positives: this watermarking scheme guarantees unforgeability and recoverability, ensuring content is linked exclusively to its generating model even under substitution attacks.

Huijia Lin, Kameron Shahabi, Min Jae Song

Natural Language Processing Red-Teaming & Adversarial Robustness

Feb 16, 2026

Microsoft ResearchFeb 16, 2026·also AI2, UW, UMD

Cold-Start Personalization via Training-Free Priors from Structured World Models

Forget RL fine-tuning: this paper shows you can beat it at cold-start personalization with a tiny model and clever Bayesian inference over structured preference priors.

Avinandan Bose, Shuyue Stella Li, Pang Wei Koh∗ +4

Recommendation & Information Retrieval World Models & Planning

Feb 11, 2026

AI2Feb 11, 2026·also CMU ML, NVIDIA, UW

MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation

Forget synthetic benchmarks that don't translate: MolmoSpaces offers 230k diverse, simulator-agnostic environments with 130k annotated objects, showing a remarkable 0.96 sim-to-real correlation for robot policies.

Wilbert Pumacay, Omar Rayyan, Max Argus +22

Eval Frameworks & Benchmarks Robotics & Embodied AI World Models & Planning

Jan 28, 2026

AI2Jan 28, 2026·also UW

SERA: Soft-Verified Efficient Repository Agents

Open-weight coding agents can now be cheaply and rapidly specialized to private codebases, thanks to a new supervised finetuning method that slashes training costs by over 25x.

Ethan Shen, Danny Tormoen, Saurabh Shah +2

Code Generation & Program Synthesis Open-Source Models & Weights Training Efficiency & Optimization

Jan 22, 2026

UWJan 22, 2026·also Mila, MIT CSAIL, Cardiovascular Research Center, Eastern New Mexico Medical Center +3

Foundation models for electrocardiogram interpretation: clinical implications.

This study establishes SSL as a promising paradigm for ECG analysis, particularly in settings with limited annotated data, enhancing accessibility, generalizability, and fairness in AI-driven cardiac diagnostics across diverse clinical environments and questions.

A. Nolin-Lapalme, Achille Sowa, Jacques Delfrate +30

Dec 22, 2025

Microsoft ResearchDec 22, 2025·also UW, Ant Group, Cornell, Fudan +6

Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA

Moxin 7B and its variants (VLM, VLA, Chinese) offer a new suite of fully transparent, open-source multimodal models, pushing beyond simple weight sharing to enable deeper customization and collaborative research.

Pu Zhao, Xuan Shen, Zhenglun Kong +16

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Open-Source Models & Weights

Oct 23, 2025

UWOct 23, 2025

VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation

Robots can now navigate more reliably and across different bodies (wheeled vs. legged) thanks to a hierarchical model that separates high-level planning from low-level physical constraints.

Mateo Guaman Castro, Sidharth Rajagopal, Daniel Gorbatov +9

Multimodal Models Robotics & Embodied AI World Models & Planning

Aug 15, 2025

Microsoft ResearchAug 15, 2025·also NVIDIA, UW, Cambridge, DTU +3

Accelerating Biomolecular Modeling with AtomWorks and RF3

Open-source biomolecular modeling just got a boost: RF3 closes the gap with AlphaFold3 in structure prediction, thanks to the new AtomWorks data framework.

Nathaniel Corley, Simon V. Mathis, Rohith Krishna +2715

Data Curation & Synthetic Data Scientific Discovery & Drug Design Training Efficiency & Optimization

Aug 11, 2025

AI2Aug 11, 2025·also Microsoft Research, NVIDIA, UW

MolmoAct: Action Reasoning Models that can Reason in Space

Robot foundation models can achieve state-of-the-art performance by explicitly reasoning about spatial plans as editable trajectory traces, rather than directly mapping perception to control.

Jason Lee, Jiafei Duan, Haoquan Fang +1666

Reasoning & Chain-of-Thought Robotics & Embodied AI World Models & Planning

Aug 6, 2025

UWAug 6, 2025

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

Train better aligned LLMs with 10% of the data by strategically focusing on the most difficult preference comparisons.

Xuan Qi, Rongwu Xu, Zhijing Jin

Data Curation & Synthetic Data RLHF & Preference Learning Training Efficiency & Optimization

May 27, 2025

UWMay 27, 2025·also SambaNova

SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge

Despite claims of safety alignment, state-of-the-art LLMs still spill the beans on hazardous scientific knowledge at an alarming rate, failing nearly 80% of the time on a new regulation-grounded benchmark.

Fengqing Jiang, Fengbo Ma, Zhangchen Xu +76

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Mar 5, 2025

UWMar 5, 2025·also Mila, MIT CSAIL, Cardiovascular Research Center, Eastern New Mexico Medical Center +3

Foundation models for generalizable electrocardiogram interpretation: comparison of supervised and self-supervised electrocardiogram foundation models

Self-supervised learning beats supervised learning for ECG interpretation when labeled data is scarce, unlocking more robust and generalizable AI-driven cardiac diagnostics.

A. Nolin-Lapalme, Achille Sowa, Jacques Delfrate +30

Open-Source Models & Weights Scientific Discovery & Drug Design Training Efficiency & Optimization