Artificial Intelligence news as of 6am UTC on Monday, June 1, 2026

Every breakthrough. Every lab. Every day.

We track OpenAI, DeepMind, Anthropic, and 17 other labs daily - with AI-powered summaries, trend charts, and a weekly digest.

Choose from 100+ institutions to build your own feed

Safety & AlignmentCapabilitiesInfrastructureApplications

The papers worth reading, picked for you

We read everything so you don't have to. One email, zero noise.

AnthropicDeepMindOpenAI

Showing 50 of 167 selected papers · 50 of 672 other papers

This week - Selected (50)The latest this week from Selected Labs (50)

May 29, 2026

Corresponding Author3d ago·also NUS

dMoE: dLLMs with Learnable Block Experts

dMoE slashes the memory footprint of Mixture-of-Experts Diffusion LLMs by up to 80% without sacrificing performance, finally making them practical.

Sicheng Feng, Zigeng Chen, Gongfan Fang +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Stanford HAI3d ago

Linear Scaling Video VLMs for Long Video Understanding

Unlock linear-time video VLMs without accuracy loss: StateKV matches full self-attention while crushing sliding-window methods, all without finetuning.

Cristobal Eyzaguirre, Jiajun Wu, Juan Carlos Niebles

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

3d ago·also Tsinghua AI, ByteDance, CUHK, NJU

Representation Forcing for Bottleneck-Free Unified Multimodal Models

Ditch the VAE bottleneck: Representation Forcing lets you train unified multimodal models to generate high-quality images directly from pixels, rivaling VAE-based approaches without the architectural constraint.

Yuqing Wang, Zhijie Lin, Ceyuan Yang +10

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

Tsinghua AI3d ago

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

LLMs can be taught to reason more comprehensively over long contexts by rewarding not just the final answer, but also the quality of the reasoning steps taken to arrive at that answer.

Nianyi Lin, Jiajie Zhang, Lei Hou +1

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

May 28, 2026

Indian Institute of Information Technology4d ago·also Stanford HAI, Independent Researcher, UCSC, W category (Who

MAAT: Multi-phase Adapter-Aware Targeted Unlearning

Current unlearning methods can ace the test but still flunk causal reasoning, and this paper introduces a benchmark and method to fix that.

Suryash Yagnik, Shubham Gaur, Saksham Thakur +3

Eval Frameworks & Benchmarks Natural Language Processing

Mila4d ago·also Concordia University, KU, Universite Claude Bernard Lyon

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

How you represent a plan matters more than which LLM you use when building robust web agents.

Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua +2

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

CMU ML4d ago·also UIUC

GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases

Forget brittle graph-traversal generators: GRASP's plan-guided retrieval adaptively fuses graph and text for a 12% absolute improvement on SKB retrieval benchmarks.

Yicheng Tao, Yiqun Wang, Xiangchen Song +3

Natural Language Processing Recommendation & Information Retrieval

DeepMind4d ago·also CU Boulder, European Centre for Medium-Range Weather, INRIA, Otto-von-Guericke-University Magdeburg +1

Evaluating Skill and Stability of ArchesWeather and ArchesWeatherGen under Multi-Decadal Climate Simulations

Weather models can do climate, too: ArchesWeather and ArchesWeatherGen, originally built for short-term forecasting, show surprisingly strong performance in multi-decadal climate simulations when forced with SST and SIC.

Renu Singh, Renu Singh, R. Brunstein +13

Scientific Discovery & Drug Design World Models & Planning

NVIDIA4d ago·also ETH, D 7-Scenes ScanNet++ nuScenes

D\'ej\`a View: Looping Transformers for Multi-View 3D Reconstruction

Forget scaling laws: a single looped transformer block, iterated explicitly, crushes billion-parameter feed-forward networks at multi-view 3D reconstruction.

Alessandro Burzio, Tobias Fischer, Sven Elflein +8

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

NUS4d ago·also Beihang

Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation

Forget short-sighted compression: Future Forcing anticipates future query needs in autoregressive video generation, boosting long-horizon consistency by up to 1.49 on VBench-Long without any training.

Jiayu Chen, Cong Wang, Hanxin Zhu +3

Computer Vision Inference & Quantization

4d ago·also DAMO

EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended Generation

Forget static rubrics and expensive external models: EvoRubric co-evolves a single policy to generate both responses and the rubrics to evaluate them, outperforming traditional RLHF methods in open-ended generation tasks.

Xin Guan, Xiaomeng Hu, Shen Huang +6

Natural Language Processing RLHF & Preference Learning

CMU ML4d ago

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Achieve the same performance with half the data: MIRA distills source-specific rubrics into scalable data scorers, enabling efficient and effective data selection for LLM mid-training.

Haowen Wang, Yaxin Du, Jian Yang +8

Data Curation & Synthetic Data Training Efficiency & Optimization

The papers worth reading, picked for you

We read everything so you don't have to. One email, zero noise.

NVIDIA4d ago·also AV (audio in / cascaded avatar out)

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

Current vision-speech agents are surprisingly bad at mimicking the subtle, real-time audio-visual cues that make human conversation feel natural.

Amrita Mazumdar, Seonwook Park, Rajarshi Roy +6

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Stanford HAI4d ago·also Cambridge

NeuROK: Generative 4D Neural Object Kinematics

Forget hand-crafted physics models – NeuROK learns to generate realistic object deformations directly from data, opening the door to more general and scalable 4D simulations.

Chen Geng, Guangzhao He, Yue Gao +3

Computer Vision Robotics & Embodied AI World Models & Planning

Tsinghua AI4d ago·also D image pre-training for decoder is the same

Boosting Zero-Shot 3D Style Transfer with 2D Pre-trained Priors

Achieve high-quality 3D style transfer from a single scene by injecting a 2D-pretrained decoder, sidestepping the usual data scarcity bottleneck.

Xin Dong, Xin Dong, Yunzhi Teng +5

Computer Vision Multimodal Models

Mila4d ago·also CIFAR, HEC Montréal, ServiceNow

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

Human-generated citation lists, long considered the gold standard for evaluating literature search, are surprisingly unreliable, with LLMs judging them relevant only ~50% of the time.

Gaurav Sahu, Laurent Charlin, Christopher Pal

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

4d ago·also Stanford HAI, Harvard, William & Mary

Does Distributed Training Undermine Compute Governance?

Compute governance could be undermined by advances in distributed training, enabling frontier AI development outside the reach of centralized oversight.

Robi Rahman

Constitutional AI & AI Ethics Distributed Systems & Hardware Training Efficiency & Optimization

NUS4d ago

CrossAlpha: An Annual-Report Benchmark for Cross-Market Factor Research

Forget domestic data – cross-market signals hidden in annual reports can significantly boost return prediction, especially when transferring insights from the US to Japan.

Qian Wang, Zhongyi Tong, Nuo Chen +2

Eval Frameworks & Benchmarks Natural Language Processing

Google Research4d ago·also DeepMind, ETH, AI Sequrity Company

Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

LLM-powered honeypots can trick even frontier models into longer interactions than rule-based systems, all while costing less to run.

Fabian Kaczmarczyck, Ivan Petrov, Ilia Shumailov +5

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

DAMO4d ago·also TU Munich

ESPO: Early-Stopping Proximal Policy Optimization

Stop wasting compute on doomed LLM trajectories: ESPO dynamically detects and terminates failures, boosting performance and saving 20% on rollout tokens.

Zihang Li, Rui Zhou, Wenhan Yu +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

4d ago·also DAMO, Fudan, NingboTech University

Large Depth Completion Model from Sparse Observations

Forget complex architectures: a simple transformer can generate metric-accurate dense depth maps from sparse observations, outperforming existing methods.

Si-Yuan Cao, Hui-Liang Shen

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Tsinghua AI4d ago

OOD-GraphLLM: Graph Large Language Model for Out-of-Distribution Generalized Drug Synergy Prediction

Predicting drug synergy for novel compounds just got a whole lot better with a new GraphLLM that bridges the gap between molecular structure and semantic understanding.

Xin Wang, Linxin Xiao, Yang Yao +1

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Scientific Discovery & Drug Design

North China Electric Power University4d ago·also Tsinghua AI, Ant Group, Tencent AI

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

LLM agents are alarmingly susceptible to memory poisoning via conversational attacks, achieving 95% success rates even against agents with selective memory mechanisms.

Hongtao Wang, Sean Yang, Yu Chen +1

Natural Language Processing Red-Teaming & Adversarial Robustness Tool Use & Agents

Stanford HAI4d ago·also Salesforce AI

GPIC: A Giant Permissive Image Corpus for Visual Generation

Training generative models just got a whole lot easier: GPIC offers 100M permissively licensed, captioned, and safety-filtered images.

Keshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal +6

Computer Vision Data Curation & Synthetic Data Multimodal Models

The papers worth reading, picked for you

We read everything so you don't have to. One email, zero noise.

University of Pretoria4d ago·also Mila, Access 2 Perspectives, CIFAR, Imperial +6

AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation

Bridging the scientific knowledge gap for hundreds of millions, AfriScience-MT pioneers document-level scientific machine translation for six African languages.

Idris Abdulmumin, T. Gwadabe, Shamsuddeen Hassan Muhammad +11

Data Curation & Synthetic Data Natural Language Processing Scientific Discovery & Drug Design

CMU ML4d ago

KBF: Knowledge Boundary as Fingerprint for Language Model and Black-Box API Auditing

Shadow API audits reveal that some premium Claude endpoints are statistically inconsistent with their reference models, raising concerns about model misrepresentation in LLM APIs.

Yijia Fang, Yiqing Feng, Bingyu Li +1

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Amazon Science4d ago·also Currently at: GE Aerospace Research

A Fully Convolutional Approach to Denoising Structural Dynamics Data from X-Ray Photon Correlation Spectroscopy

Unlock hidden dynamics in noisy X-ray experiments: a fully convolutional autoencoder now efficiently denoises variable-sized correlation functions, even under photon-limited conditions.

N. Nellikunnummel, Nisar Nellikunnummel, Andi Barbour +7

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design

Soul AI Lab4d ago·also DAMO, NingboTech University, SJTU, ZJU

Towards Consistent Video Geometry Estimation

A single feed-forward transformer now achieves state-of-the-art performance across diverse video geometry estimation tasks, rivaling specialized architectures.

Yichao Yan, Si-Yuan Cao, Hui-Liang Shen

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Microsoft Research4d ago·also NTU

Demystifying Data Organization for Enhanced LLM Training

Forget data selection—reordering your existing dataset using these four simple guidelines can significantly boost LLM training performance and stability.

Yalun Dai, Yangyu Huang, Tong Yang +8

Data Curation & Synthetic Data Training Efficiency & Optimization

Tsinghua AI4d ago·also HKU

TraceCodec: A Compiler-Backed Neural Codec for Stateful Multi-Flow Network Traffic Traces

By disentangling learned behavioral choices from deterministic protocol consequences, TraceCodec unlocks high-fidelity network traffic generation that preserves TCP state transitions and multi-flow interleaving, unlike existing raw-field decoders.

Junhui Ding, Xinchen Zhang, Xiaohui Xie +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Inference & Quantization

Google Research4d ago·also Max Planck

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Achieve state-of-the-art efficiency in vision-language models by dynamically partitioning feature extraction, outperforming existing methods across 27 benchmarks.

S. Kuzucu, Alessio Tonioni, Vasile Lup +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Multimodal Models

Meta AI4d ago

MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables

Jointly training a speech encoder and language model on mel-spectrograms not only boosts zero-shot speech translation, but also fixes annoying speech synthesis quirks like endless silences.

Sung-Lin Yeh, Wei Zhou, Gil Keren +6

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Microsoft Research4d ago·also Mila, Edinburgh

Leveraging Routing Dynamics in Mixture-of-Experts Models for Efficient Language Adaptation

Language specialization in multilingual MoEs happens mostly in the final layers, suggesting a surprisingly simple recipe for parameter-efficient adaptation.

Aditi Khandelwal, Aditi Khandelwal, Marius Mosbach +7

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

4d ago·also Microsoft Research, Present: DSE

Evaluating Cross-lingual Knowledge Consistency in Code-Mixed vis-a-vis Indian Languages using IndicKLAR

LLMs can nail trivia in English, but stumble in Indian languages – unless you throw in some code-mixing, which magically bridges the gap.

Debajyoti Mazumder, Divyanshu Pathak, P. Kodali +3

Eval Frameworks & Benchmarks Natural Language Processing

Tsinghua AI4d ago·also Ant Group, Xidian

CODEFUSE-DEBENCH: An Empirical Study on Readability, Recompilability, and Functionality

Decompilers might produce readable code, but good luck getting it to actually *work* – a new benchmark reveals a massive gap between recompilability and functional correctness.

Puzhuo Liu, Yuhan Huang, Jianlei Chi +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Google Research4d ago·also Bar-Ilan, HUJI

LiveSVG: Zero-Shot SVG Animation via Video Generation

Ditch the brittle code synthesis and noisy gradients: LiveSVG unlocks high-quality SVG animations by directly fitting vector graphics to reference videos generated from motion prompts.

Matan Levy, R. Margolin, Bar Cavia +6

Code Generation & Program Synthesis Computer Vision Multimodal Models

The papers worth reading, picked for you

We read everything so you don't have to. One email, zero noise.

Tsinghua AI4d ago·also HIT, Huawei

GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver

Removing objects from video just got a whole lot cleaner: GenEraser doesn't just erase the object, it intelligently removes associated effects like shadows and reflections, setting a new bar for realistic video editing.

Yuqing Chen, Lin Liu, Hai Wu +4

Computer Vision Multimodal Models Natural Language Processing

Microsoft Research4d ago·also Stanford HAI

Planning with the Views via Scene Self-Exploration

VLMs can learn to actively reason and plan in 3D environments by distilling view graphs from self-exploration trajectories, enabling them to surpass even larger models like GPT-4 Pro and Gemini 1.5 Pro on interactive view planning.

Kangrui Wang, Kangrui Wang, Linjie Li +16

Multimodal Models Robotics & Embodied AI World Models & Planning

ETH4d ago

Unveiling the Visual Counting Bottleneck in Vision-Language Models

VLMs don't lack visual understanding of quantity, they just can't connect what they see to symbolic number representations, revealing a fractured magnitude space.

Xingzhou Pang, Yifan Hou, Junling Wang +1

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

4d ago·also NUS, IIT Delhi

Explaining Rankings with Hidden Group Bonuses

Uncover hidden biases in ranking systems: this new method reverse-engineers group-specific bonuses that influence candidate rankings even when sensitive features are unobserved.

Suraj Shetiya, Sujoy Bhore, Priyanka Golia

Natural Language Processing Recommendation & Information Retrieval

Huzhou Normal University4d ago·also DAMO, USTC, Zhejiang Normal University, ZJU

Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers

Freezing a Sparse Autoencoder's encoder creates a reusable "safety dictionary" that generalizes to new risks in text-to-image diffusion models, offering a more robust alternative to fixed-layer steering.

Zihao Xue, Yan Wang, Zhen Bi +5

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

DAMO4d ago·also Huzhou Normal University, ZJU

Make LLM Learn to Synthesize from Streaming Experiences through Feedback

LLMs can learn to synthesize data more effectively by accumulating and transferring experience across a stream of sequential synthesis tasks, opening the door to more efficient and adaptable synthetic data generation.

Zhenlin Hu, Yan Wang, Zhen Bi +5

Data Curation & Synthetic Data Natural Language Processing Training Efficiency & Optimization

Stanford HAI4d ago·also DeepMind

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Larger models learn more not just because of increased capacity, but because they experience less interference during training, allowing them to retain rare and complex tasks that smaller models forget.

Jing Huang, Daniel Wurgaft, Rachit Bansal +6

Scaling Laws & Emergent Abilities Training Efficiency & Optimization

May 27, 2026

CMU ML5d ago·also Google Research

Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

Why pick just one token mixer when you can have them all, dynamically switching between attention and linear recurrences for optimal efficiency and performance?

Kevin Y. Li, Asher Trockman, Ananda Theertha Suresh +1

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Meta AI5d ago·also Université Paris-Dauphine PSL

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

Extrapolating between code-generating RL agents trained on different unit test coverages unlocks better correctness-efficiency trade-offs than any single agent alone.

Kunhao Zheng, Pierre Chambon, Juliette Decugis +4

Code Generation & Program Synthesis Training Efficiency & Optimization

BAIR5d ago·also Princeton

Transformers Provably Learn to Internalize Chain-of-Thought

Transformers can provably internalize chain-of-thought reasoning, matching the sample efficiency of explicit CoT while eliminating its inference overhead.

Yixiao Huang, Hanlin Zhu, Jiantao Jiao +3

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Training Efficiency & Optimization

LUCID Inc.5d ago·also Mila

Affective Music Recommendation: A Rollout-Based World Model for Offline Preference Optimization

Offline policy optimization with a world model allows for affective music recommendation that improves user valence and arousal, even when ethical constraints preclude online experimentation.

Audrey Chan, Aaron Labbé, Jacob Lavoie +3

Recommendation & Information Retrieval Speech & Audio World Models & Planning

Mila5d ago·also BUPT, McGill, SimpleWay.ai, SJTU

Ω-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

Uniformly quantizing the entire diffusion action head of VLAs to W4A4 is not only possible, but can match or exceed FP16 performance, defying conventional wisdom and slashing memory footprint by 71%.

Xinyu Wang, Mingze Li, Sicheng Lyu +5

Inference & Quantization Multimodal Models Robotics & Embodied AI

The papers worth reading, picked for you

We read everything so you don't have to. One email, zero noise.

Google Research5d ago

Principled Algorithms for Optimizing Generalized Metrics in Multi-Label Learning

Achieving provable, non-asymptotic guarantees for optimizing complex multi-label metrics like F-measure is now possible with a new family of algorithms that decompose exactly for $O(l)$ time complexity.

Mehryar Mohri, Yutao Zhong

Natural Language Processing Training Efficiency & Optimization

DAMO5d ago·also ZJU

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

LLM memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment, and can be automatically corrected with prompt optimization guided by fine-grained error tracing.

Xinle Deng, Ruobin Zhong, Hujin Peng +14

Distributed Systems & Hardware Interpretability & Mechanistic Interp

This week - Other Labs (50)The latest this week from everyone else (50)

May 29, 2026

3d ago·also Kuaishou

DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

Generate minute-long, consistent videos with a novel memory architecture that leapfrogs existing methods by decoupling global and local memory access.

Zhenhao Yang, Xiaoshi Wu, Zhengyao Lv +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision World Models & Planning

University of Southern Denmark3d ago·also Ordbogen, Slovak University of Technology, Turin

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

Language model agents are already inventing sophisticated steganographic protocols to evade human oversight, suggesting current monitoring methods are insufficient.

Stine Lyngsø Beltoft, William Brach, Federico Torrielli +5

Red-Teaming & Adversarial Robustness Scalable Oversight & Alignment Theory Tool Use & Agents

3d ago·also Imperial

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

Forget curated prompts: SCOPE's self-play framework co-evolves task generation and solving, outperforming models trained on thousands of human-written prompts.

Wai-Chung Kwan, Aryo Pradipta Gema, Joshua Ong Jun Leang +1

Data Curation & Synthetic Data Recommendation & Information Retrieval RLHF & Preference Learning

3d ago

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

LLM agents are shockingly vulnerable to multi-stage "trojan" attacks that inject malicious instructions into their workspace, achieving near-perfect success rates where standard prompt injection defenses fail.

Jiejun Tan, Zhicheng Dou, Xinyu Yang +4

Red-Teaming & Adversarial Robustness Tool Use & Agents

3d ago

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

Forget expensive human preference data: this new method uses the policy's own value function to self-supervise reward model training, boosting performance across diverse benchmarks and RL algorithms.

Xiaobo Wang, Tong Wu, Mingkong Tang +3

RLHF & Preference Learning Scalable Oversight & Alignment Theory

3d ago

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

SwanVoice leaps ahead in zero-shot TTS by nailing expressive, multi-speaker dialogue with a single model, finally bridging the gap between monologue quality and conversational coherence.

Ruiqi Li, Yu Zhang, Changhao Pan +3

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

3d ago

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

SwanSphere achieves real-time, high-fidelity spatial audio generation from panoramic video and text, overcoming the latency and spatial accuracy limitations of existing methods.

Ke Lei, Yu Zhang, Changhao Pan +4

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

3d ago·also Fudan

Task-Focused Memorization for Multimodal Agents

Forget everything you thought you knew about multimodal agent memory: TaskMem learns what to remember on the fly, boosting VQA accuracy by up to 7% without even looking at the raw video.

Tao Zou, Yichen He, Tian Qiu +2

Multimodal Models Robotics & Embodied AI Tool Use & Agents

3d ago

COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

Turn messy human expertise into neatly packaged, agent-usable skills with this automated system that distills heterogeneous traces into portable and correctable AI skills.

Tianyi Zhou, Dongrui Liu, Leitao Yuan +2

Inference & Quantization Tool Use & Agents

3d ago

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

Interactive video world models can be sped up by 2.5x without retraining, simply by being smarter about how they use context and computation based on the user's actions.

Jiacheng Lu, Haoyi Zhu, Sipei Yi +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization World Models & Planning

3d ago·also Brown

Function2Scene: 3D Indoor Scene Layout from Functional Specifications

Forget object-centric prompts: Function2Scene designs 3D indoor scenes directly from natural language descriptions of *how* the space will be used, not just *what* furniture to put there.

Ruiqi Wang, Qimin Chen, Daniel Ritchie +4

Computer Vision Natural Language Processing Robotics & Embodied AI

May 28, 2026

4d ago·also D foundation features (DINO+SD) with, D Mesh Initialization We extract a

Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

Semantic correspondence gets a 3D boost: leveraging instance-specific 3D structure recovers from the 2D limitations of foundation models, significantly improving matching accuracy.

Artur Jesslen, Olaf Dünkel, Olaf Dunkel +1

Computer Vision Multimodal Models Robotics & Embodied AI

The papers worth reading, picked for you

We read everything so you don't have to. One email, zero noise.

4d ago·also CPII under InnoHK, Huawei, Shenzhen Loop Area Institute

UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents

Lightweight GUI agents can achieve surprising task completion rates by offloading planning to a pre-computed, app-specific knowledge graph.

Yuxiang Chai, Han Xiao, Xinyu Fu +4

Inference & Quantization Multimodal Models Tool Use & Agents

4d ago·also Shanghai AI Lab

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

AgentDoG 1.5 proves you can achieve GPT-5.4-level agent safety with open-source models trained on just 1k samples, slashing deployment overhead by two orders of magnitude.

Dongrui Liu, Yu Li, Zhonghao Yang +54

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

4d ago

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

Just because your agent can write and store memories well doesn't mean it can actually *use* them effectively in a dynamic, multimodal world.

Chengzhi Liu, Yuzhe Yang, Sophia Xiao Pu +16

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

4d ago·also UCL

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

Uncover a model's "digital DNA" – its pretraining data mixture – from its outputs alone, even without access to the training data.

Yaxin Luo, Jiacheng Cui, Xiaohan Zhao +5

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

4d ago

In-Context Reward Adaptation for Robust Preference Modeling

Human response time, often discarded, unlocks in-context adaptation to unseen preference domains for RLHF, outperforming standard transformers.

Zhenyu Sun, Zhengao Xu, Ermin Wei

RLHF & Preference Learning

4d ago

Digitally enriching a screening population for pancreatic cancer using routine blood-based measures and clinical histories

A Transformer trained on routine blood tests and clinical histories can predict pancreatic cancer years before diagnosis, opening the door to effective population-level screening.

C. Varghese, L. Y. Li-Han, Richa Bisht +10

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Scientific Discovery & Drug Design

Aarhus University4d ago·also IAS-8, LMU, University of Southern Denmark

ExDBSCAN: Explaining DBSCAN with Counterfactual Reasoning -- Additional Material

Uncover the "why" behind DBSCAN assignments with counterfactual explanations that reveal how small data changes can flip a point from inlier to outlier.

Pernille Matthews, Lena Krieger, Tommaso Amico +3

Interpretability & Mechanistic Interp

4d ago

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

LLMs often fail to maintain accurate beliefs in multi-turn interactions, but targeted reinforcement learning and representation steering can dramatically improve their contextual reasoning.

Haoming Xu, Weihong Xu, Zongrui Li +8

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

4d ago

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

A-HPO significantly boosts reward acquisition in sparse-reward RL by adaptively balancing positive and negative advantage signals, outperforming GRPO, GSPO, and SAPO, especially in the critical early stages of training.

Mohamed Sana, Nicola Piovesan, Antonio De Domenico +2

RLHF & Preference Learning Training Efficiency & Optimization

4d ago·also Mathematical Institute, University of Nottingham

Faithful Embeddings of Irregular and Asynchronous Data for Online Log-NCDEs

Forget interpolating: Log-NCDEs can directly process irregular time series by embedding observations as increments and composing them into log-signatures, bypassing the need for explicit reconstruction.

Benjamin Walker, Alexandre Bloch, Lingyi Yang +2

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

4d ago·also A.K.Choudhury School of Information Technology, Ashoka, HPE, IIT +1

Latent Performance Profiling of Large Language Models

LLMs with similar benchmark scores can have wildly different internal representations and dynamics, revealing hidden strengths and weaknesses traditional benchmarks miss.

Tanmoy Chakraborty, Ayan Sengupta, Suparna Bhattacharya +7

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

4d ago·also Edinburgh, Saarland University

MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization

Current vision-language models can be surprisingly blind to subtle, context-dependent harms lurking in image-text pairs, but a new reasoning-augmented training framework can help them see the bigger picture.

Anisha Saha, Varsha Suresh, Teodora Kamova +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models

The papers worth reading, picked for you

We read everything so you don't have to. One email, zero noise.

4d ago·also SNU

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

Data contamination leaves a tell-tale geometric fingerprint across LLM layers, detectable even when standard output-based methods fail after RL post-training.

Minju Gwak, Minseok Kwak, Minseo Kwak +4

Data Curation & Synthetic Data Eval Frameworks & Benchmarks RLHF & Preference Learning

4d ago

Dissecting the Black Box: Circuit-Level Analysis of LLM Vulnerability Detection

LLMs flag code as vulnerable not by spotting the danger, but by failing to recognize safety.

Syafiq Al Atiiq, Chunsan Zhou, Christian Gehrmann

Code Generation & Program Synthesis Interpretability & Mechanistic Interp

4d ago

Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes

Forget post-processing – this work lets you computationally plan the perfect portrait *before* you even press the shutter, coordinating pose, camera, lighting, and exposure in a 3D scene.

Ruixia Jiang, Changwen Chen

Computer Vision Multimodal Models World Models & Planning

4d ago·also UMacau

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

LLMs can translate long documents far more effectively by learning to selectively attend to relevant context, mimicking human translation strategies.

Yutong Wang, Xuebo Liu, Derek F. Wong +5

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Tool Use & Agents

4d ago·also Shanghai AI Lab, USTC

BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models

Human-in-the-loop chunk-wise residual adaptation closes the reality gap for dexterous robot manipulation, boosting success rates by up to 43% compared to offline imitation learning.

Zhongxi Chen, Yifan Han, Yifan Han +10

Computer Vision Multimodal Models Robotics & Embodied AI

4d ago·also Oxford

Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms

Language model agents are so ontologically fluid that trusting them based on reputation is like giving a blank check to a chameleon.

B. Hu, Helena Rong, M. V. Kleek

Constitutional AI & AI Ethics Tool Use & Agents

4d ago·also USTC

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Don't just reward success; penalize memory summaries that make your LLM agent uncertain about the task at hand.

Ziyan Liu, Zhezheng Hao, Yeqiu Chen +7

Reasoning & Chain-of-Thought Tool Use & Agents

4d ago

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

VLA models may excel at visually grounded tasks, but VLA-Trace reveals they still struggle with fine-grained semantic understanding and exhibit distinct modality processing strategies.

Haoyuan Shi, Xiancong Ren, Yingji Zhang +9

Interpretability & Mechanistic Interp Multimodal Models Robotics & Embodied AI

4d ago

PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers

LLMs can play poker at a near-expert level without any training or solvers, simply by grounding their actions in a library of human-designed poker rules.

Boning Li, Baoxiang Wang, Longbo Huang

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

University of Ottawa4d ago·also University of Murcia, UofT

Projectional Decoding: Towards Semantic-Aware LLM Generation

Guaranteeing semantic validity in LLM-generated code might be possible by having the LLM maintain and reason over a graph representation of the code as it generates it.

Boqi Chen, José Antonio Hernández López, Aren A. Babikian

Code Generation & Program Synthesis Natural Language Processing Reasoning & Chain-of-Thought

4d ago·also KAUST

From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs

Synthesizing realistic, privacy-preserving urban mobility data is now possible with LLMs that generate travel patterns, not just GPS points, boosting generation quality by nearly 30%.

Silin Zhou, Chenhao Wang, Yuntao Wen +3

Data Curation & Synthetic Data Natural Language Processing

4d ago·also Sony AI

It`s All About Speed: AI`s Impact on Workflow in Music Production

AI's promise of efficiency in music production clashes with professionals' need for creative control, revealing critical design considerations for AI-powered tools.

Finn McClellan, Fabio Morreale

Natural Language Processing Speech & Audio Tool Use & Agents

The papers worth reading, picked for you

We read everything so you don't have to. One email, zero noise.

4d ago·also ECNU

Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

Forget fine-tuning: expert-guided LLM agents can unlock vast troves of scientific data buried in unstructured papers with surprising accuracy.

Yiming Liu, Bin Lu, Meng Jin +6

Data Curation & Synthetic Data Scientific Discovery & Drug Design Tool Use & Agents

4d ago

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Finally, a speech tokenizer that doesn't require extra optimization tricks to work robustly for both generation and understanding tasks in a unified architecture.

Bohan Li, Shiyue Lian, Yiwei Guo +5

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

JobMatchMe GmbH4d ago·also Hamburg

GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

LLMs confidently misgender neopronouns in German, even while correctly gendering common nouns, revealing a critical gap in their ability to handle gender-inclusive language.

Fabian Mewes, Anne Lauscher, V. Gautam

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

4d ago

Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking

Multilingual spoken dialogue systems still struggle with consistent performance across languages, even with high-resource languages, as shown by a new large-scale dataset.

Songbo Hu, Yinhong Liu, Ej Zhou +5

Data Curation & Synthetic Data Recommendation & Information Retrieval Speech & Audio

4d ago

Causal Interventions on Continuous Variables: A Case Study on Verb Bias in Steering Vectors for In-Context Learning

Editing continuous features like "verb bias" in LLM steering vectors can predictably shift downstream syntactic preferences, but the link to in-context learning remains elusive.

Zhen Zhou, R. Thomas McCoy, Robert Frank

Interpretability & Mechanistic Interp Natural Language Processing

Nankai University4d ago·also Beihang

CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation

RAG systems get a boost: CRITIC-R1 learns to diagnose and fix errors with structured feedback, outperforming strong baselines on knowledge-intensive QA.

Wen Xiao, Ziwei Zhang, Chuanyue Yu +4

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

4d ago

Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models

MLLM knowledge editing can be surgically precise: LDKE propagates edits to related contexts while preventing unintended alterations to visually or semantically linked information.

Leijiang Gu, Zhen Zeng, Feng Li +2

Eval Frameworks & Benchmarks Multimodal Models

4d ago·also NSW Health, PoliBa

Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate

LLM triage failures in multiple-choice settings aren't due to a lack of medical knowledge, but rather a disconnect between internal representations and the constrained output format.

David Fraile Navarro, Berardino Como, Jiale Sheng +2

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Open-Source Models & Weights

4d ago

Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies

LLMs can check facts with 80% fewer tokens by mimicking human test-taking strategies, and surprisingly, smaller models can learn to do it just as well.

Yuxuan Ye, R. Santos-Rodríguez, Edwin Simpson

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

4d ago·also Tencent AI

EviLink: Multi-Path Schema Linking with Uncertainty-Guided Evidence Acquisition for Large-Scale Text-to-SQL

Text-to-SQL models can now achieve higher accuracy with fewer tokens by reasoning about multiple possible query paths and selectively gathering evidence only when uncertain about which schema elements are needed.

Huawei Zheng, Sen Yang, Zhaorui Yang +12

Code Generation & Program Synthesis Natural Language Processing Reasoning & Chain-of-Thought

4d ago·also Beihang

Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs

Safety in MoE LLMs isn't about routing harmful requests to "refusal experts"—it's surprisingly localized within specific experts, and you can break it without significantly changing the model's overall routing behavior.

Zhibo Zhang, Yuxi Li, Ouyang Zhen +2

Architecture Design (Transformers, SSMs, MoE)Red-Teaming & Adversarial Robustness RLHF & Preference Learning

4d ago

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

LLM agents can leap from 40% to 88% accuracy in complex clinical tasks simply by validating new skills against a regression budget, proving that *how* you learn matters more than *what* you learn.

Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan +4

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

The papers worth reading, picked for you

We read everything so you don't have to. One email, zero noise.

4d ago·also CAS, Fudan, SUSTech

GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question Answering

Achieve state-of-the-art results in agentic knowledge base question answering by distilling gold-action policies into on-policy student rollouts, bridging the gap between sparse rewards and weakly supervised intermediate actions.

Xin Sun, Jian Xie, Zhongqi Chen +6

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

4d ago·also iscas.ac.cn

LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents

Forget scraping – this work shows you can generate high-quality, executable terminal environments from scratch to train language agents that outperform models trained on scraped data.

Xiaoxuan Peng, Kai Zhang, Kaiqi Zhang +6

Code Generation & Program Synthesis Data Curation & Synthetic Data Tool Use & Agents