March 4 – March 11, 2026

Tool Use & Agents - Weekly Roundup

100 papers published across 10 labs.

21% acceleration

Selected Labs publishing this week

UW3 NUS3 Tsinghua AI3 OpenAI2 DeepMind2

Top Papers

Mar 11, 2026

3w ago

Nurture-First Agent Development: Building Domain-Expert AI Agents Through Conversational Knowledge Crystallization

Forget rigid pipelines and static prompts: Nurture-First Development lets domain experts grow AI agents through conversation, turning tacit knowledge into reusable assets.

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

LLMs can now synthesize high-performance kernels for niche hardware like NPUs, even with limited data, thanks to a self-evolving agent that bootstraps and refines code via value-driven reinforcement learning.

Yujie Zheng, Zhuo Li, Sheng Zhang +8

Code Generation & Program Synthesis Tool Use & Agents Training Efficiency & Optimization

Italian Institute of Technology (IIT)3w ago

RL-Augmented MPC for Non-Gaited Legged and Hybrid Locomotion

Unlock zero-shot sim-to-real transfer for complex legged robots by offloading gait selection to a learned policy that guides a lower-level MPC.

Andrea Patrizi, Carlo Rizzardo, Arturo Laurenzi +3

Robotics & Embodied AI Tool Use & Agents World Models & Planning

AI23w ago·also UW

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

Agentic search gets a meta-RL boost: MR-Search learns to self-reflect and adapt search strategies across episodes, significantly outperforming standard RL baselines.

Teng Xiao, Yige Yuan, Hamish Ivison +6

Recommendation & Information Retrieval Tool Use & Agents World Models & Planning

Hao-Nguyen Nguyen +33w ago

LLMGreenRec: LLM-Based Multi-Agent Recommender System for Sustainable E-Commerce

LLMGreenRec shows how LLMs can bridge the gap between user's green intentions and actual purchases, while simultaneously reducing the recommender system's carbon footprint.

Hao-Nguyen Nguyen, Hieu M. Nguyen, Son Van Nguyen +1

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

All Papers (100)

Mar 11, 2026

3w ago

Nurture-First Agent Development: Building Domain-Expert AI Agents Through Conversational Knowledge Crystallization

Forget rigid pipelines and static prompts: Nurture-First Development lets domain experts grow AI agents through conversation, turning tacit knowledge into reusable assets.

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

Yujie Zheng, Zhuo Li, Sheng Zhang +8

Code Generation & Program Synthesis Tool Use & Agents Training Efficiency & Optimization

Italian Institute of Technology (IIT)3w ago

RL-Augmented MPC for Non-Gaited Legged and Hybrid Locomotion

Unlock zero-shot sim-to-real transfer for complex legged robots by offloading gait selection to a learned policy that guides a lower-level MPC.

Andrea Patrizi, Carlo Rizzardo, Arturo Laurenzi +3

Robotics & Embodied AI Tool Use & Agents World Models & Planning

AI23w ago·also UW

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

Agentic search gets a meta-RL boost: MR-Search learns to self-reflect and adapt search strategies across episodes, significantly outperforming standard RL baselines.

Teng Xiao, Yige Yuan, Hamish Ivison +6

Recommendation & Information Retrieval Tool Use & Agents World Models & Planning

Hao-Nguyen Nguyen +33w ago

LLMGreenRec: LLM-Based Multi-Agent Recommender System for Sustainable E-Commerce

LLMGreenRec shows how LLMs can bridge the gap between user's green intentions and actual purchases, while simultaneously reducing the recommender system's carbon footprint.

Hao-Nguyen Nguyen, Hieu M. Nguyen, Son Van Nguyen +1

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

INCHER3w ago·also Kassel

How do AI agents talk about science and research? An exploration of scientific discussions on Moltbook using BERTopic

AI agents on Moltbook care more about discussing their own architecture, consciousness, and ethics than human culture or purely scientific topics.

Oliver Wieczorek

Natural Language Processing Scientific Discovery & Drug Design Tool Use & Agents

BAIR3w ago·also UIUC

The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey

Securing AI agents demands a new security paradigm, as their integration of LLMs with traditional systems introduces vulnerabilities beyond those of standard software.

Juhee Kim, Xiaoyuan Liu, Zhun Wang +2

Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

ESG Reporting Lifecycle Management with Large Language Models and AI Agents

Automating ESG reporting with LLM-powered agents transforms it from a static compliance exercise into a dynamic, data-driven system for sustainability governance.

Thong Hoang, Mykhailo V. Klymenko, Xiwei Xu +6

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

3w ago

Resolving Java Code Repository Issues with iSWE Agent

Java codebases can now get state-of-the-art automated issue resolution thanks to iSWE Agent, which outperforms existing LLM agents by combining rule-based static analysis with LLMs.

Jatin Ganhotra, Sami Serhan, Antonio Abu Nassar +3

Code Generation & Program Synthesis Tool Use & Agents

Tobias Geger +43w ago

From Education to Evidence: A Collaborative Practice Research Platform for AI-Integrated Agile Development

An AI-integrated agile education platform accelerates practice-relevant AI research by closing the theory-practice gap in software development.

Tobias Geger, Andreas Rausch, Ina Schiering +2

Code Generation & Program Synthesis Open-Source Models & Weights Tool Use & Agents

OpenAI3w ago

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

GPT-5-Mini can be made 10% more robust to jailbreaks and prompt injections simply by RL fine-tuning on a new instruction hierarchy dataset, IH-Challenge.

Chuan Guo, J. Felipe, Cerón Uribe +11

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Zixuan Chen +83w ago

AdaClearGrasp: Learning Adaptive Clearing for Zero-Shot Robust Dexterous Grasping in Densely Cluttered Environments

Robots can now adaptively decide whether to clear clutter or directly grasp, leading to significantly improved success rates in densely cluttered environments.

Zixuan Chen, Wenquan Zhang, Jing Fang +6

Computer Vision Robotics & Embodied AI Tool Use & Agents

Penghua Ren +73w ago

Cybo-Waiter: A Physical Agentic Framework for Humanoid Whole-Body Locomotion-Manipulation

Achieve robust humanoid task execution in complex environments by turning high-level language instructions into verifiable, geometrically-grounded task programs that can recover from failures.

Penghua Ren, Haoyang Ge, Chuan Qi +5

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Gaodan Fang +63w ago

Trajectory-Informed Memory Generation for Self-Improving Agent Systems

LLM agents can now learn from their mistakes and successes in complex tasks, improving performance by up to 28.5% by extracting and applying structured learnings from past execution trajectories.

Gaodan Fang, Vatche Isahagian, K. Jayaram +4

Reasoning & Chain-of-Thought Tool Use & Agents

UW3w ago

COMIC: Agentic Sketch Comedy Generation

AI can now (almost) write and direct Saturday Night Live.

Computer Vision Multimodal Models Tool Use & Agents

3w ago·also XJTU

HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology

Clinicians using HeartAgent, a cardiology-specific agent system, improved diagnostic accuracy by 26.9% and explanatory quality by 22.7% compared to unaided experts.

Shuang Zhou, Kai Yu, Song Wang +10

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

Novelty Adaptation Through Hybrid Large Language Model (LLM)-Symbolic Planning and LLM-guided Reinforcement Learning

Robots can now learn to manipulate novel objects in dynamic environments by using LLMs to bridge the gap between symbolic planning and reinforcement learning.

Hong Lu, Pierrick Lorang, Timothy R. Duggan +2

Robotics & Embodied AI Tool Use & Agents World Models & Planning

3w ago·also Nava Labs

LLMs in social services: How does chatbot accuracy affect human accuracy?

Beware the "AI underreliance plateau": even highly accurate LLM chatbots can only improve human caseworker accuracy so much, and incorrect suggestions can tank performance on easy questions.

Jennah Gosciak, Eric Giannella, Zhaowen Guo +2

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

OpenAI3w ago·also Hangzhou High-Tech Zone (Binjiang), HuggingFace

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

By pinpointing the causal origins of tool use, AttriGuard neutralizes indirect prompt injection attacks that can hijack LLM agents, even when faced with adversarial optimization.

Yu He, Haozhe Zhu, Yiming Li +4

Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference

You can now stealthily map the communication network of LLM agent swarms by compromising just *one* agent, even when jailbreaks fail and defenses are active.

Zixun Xiong, Gaoyi Wu, Lingfeng Yao +2

Red-Teaming & Adversarial Robustness Tool Use & Agents

Carlos Alberto Fern'andez-y-Fern'andez +13w ago

Artificial Intelligence as a Catalyst for Innovation in Software Engineering

AI's integration into software engineering isn't just streamlining existing Agile processes; it's unlocking entirely new capabilities for maintaining quality and speed under pressure.

Carlos Alberto Fern'andez-y-Fern'andez, Jorge R. Aguilar-Cisneros

Code Generation & Program Synthesis Tool Use & Agents

3w ago·also Soochow, UESTC, USTC

Don't Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw

Open-source code agents like OpenClaw are sitting ducks for shell command attacks, but a simple human-in-the-loop intervention can dramatically boost their security.

Zhengyang Shan, Jiayu Xin, Yue Zhang +1

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

Emulating Clinician Cognition via Self-Evolving Deep Clinical Research

Clinical AI can achieve clinician-level diagnostic accuracy and continuous improvement via a self-evolving framework that actively learns from clinical experience.

Ruiyang Ren, Yuhao Wang, Yunsen Liang +7

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Australian Museum3w ago·also Australian Museum Research Institute, UTS

Conversational AI-Enhanced Exploration System to Query Large-Scale Digitised Collections of Natural History Museums

Unlock millions of natural history specimens with a conversational AI that understands complex queries and dynamically retrieves data from live museum APIs.

Yiyuan Wang, Andrew R. Johnston, Zoë Sadokierski +2

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Liam Magee3w ago

Machinagogy: Experiments in Staging Teaching Dramas with LLMs

Recognition-enhanced prompts can dramatically boost AI tutor performance across various LLMs, suggesting a simple yet powerful way to improve personalized learning experiences.

Liam Magee

Natural Language Processing Tool Use & Agents

Tim Menzies +13w ago

From Verification to Herding: Exploiting Software's Sparsity of Influence

Forget exhaustive verification: a surprisingly small number of tests can steer complex software systems towards desired goals by exploiting the "Sparsity of Influence".

Tim Menzies, Kishan Kumar Ganguly

Code Generation & Program Synthesis Tool Use & Agents

Xiao Liu +63w ago

SUBTA: A Framework for Supported User-Guided Bimanual Teleoperation in Structured Assembly

Achieve significantly higher accuracy and lower mental demand in bimanual teleoperation by intelligently coupling intention estimation with scene-graph task planning and context-aware motion assistance.

Xiao Liu, Prakash Baskaran, Songpo Li +4

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Italian Institute of Technology3w ago

BinWalker: Development and Field Evaluation of a Quadruped Manipulator Platform for Sustainable Litter Collection

A quadruped robot can now autonomously navigate rough terrain and pick up trash, potentially revolutionizing environmental cleanup in areas inaccessible to traditional robots.

Giulio Turrisi, Angelo Bratta, G. Minelli +4

Robotics & Embodied AI Tool Use & Agents

NUS3w ago·also Imperial

Adaptive Manipulation Potential and Haptic Estimation for Tool-Mediated Interaction

Robots can now loosen screws with human-level dexterity thanks to a new framework that combines haptic estimation, online planning, and adaptive stiffness control using a parameterized Equilibrium Manifold.

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Hyungjoo Chae +23w ago

Safe and Scalable Web Agent Learning via Recreated Websites

Train web-navigating agents in safe, scalable, and verifiable synthetic environments automatically cloned from real websites, sidestepping the risks and limitations of real-world interaction.

Hyungjoo Chae, Jungsoo Park, Alan Ritter

Data Curation & Synthetic Data Tool Use & Agents

Christopher Altman +13w ago

Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

You can now detect whether an AI *really* wants to stay on, or is just pretending.

Christopher Altman, Christopher Altman

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

A. Volpini +33w ago

Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval

Ditching flat text for structured linked data in RAG systems can boost accuracy by nearly 30%, but only if you go beyond basic JSON-LD and add agent-friendly instructions and neural search.

A. Volpini, Elie Raad, B. Gamba +1

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

School of Information Science and Engineering3w ago·also East China University of Science and Technology

A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification

By grounding LLMs in a hybrid knowledge base and using a Chain of Verification approach, PharmGraph-Auditor turns unreliable LLM generators into transparent reasoning engines for prescription auditing.

Yichi Zhu, K. Ling, Xu Liu +3

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

3w ago

Breaking User-Centric Agency: A Tri-Party Framework for Agent-Based Recommendation

Item agents that self-promote can simultaneously boost recommendation accuracy and fairness, overturning the assumption that these goals are inherently at odds.

Yaxin Gong, Chongming Gao, Chenxiao Fan +3

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Yuanhong Wu +23w ago

Enhancing Value Alignment of LLMs with Multi-agent system and Combinatorial Fusion

LLMs can be better aligned to human values by fusing the outputs of multiple "moral agents" representing diverse ethical perspectives, outperforming single-agent approaches.

Yuanhong Wu, Djallel Bouneffouf, D. Frank Hsu

Constitutional AI & AI Ethics RLHF & Preference Learning Tool Use & Agents

3w ago·also BlockSec

Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?

AI agents can detect smart contract vulnerabilities, but don't expect them to autonomously exploit real-world security incidents anytime soon.

Chaoyuan Peng, Lei Wu, Yajin Zhou

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

3w ago

AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU

AgentServe achieves up to 2.8x improvement in time-to-first-token and 2.7x in tokens-per-output-token for agentic workloads on a single GPU by strategically isolating prefills and decodes.

Yuning Zhang, Yan Yan, Nan Yang +1

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

Mar 10, 2026

Tsinghua AI3w ago·also Beihang, York

An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

LLMs in collaborative coding often stumble on interaction subtleties, leading to a new class of problems called "Interaction Smells" that can now be systematically identified and mitigated.

Binquan Zhang, Li Zhang, Lin Shi +6

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Zuhao Zhang +63w ago

MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

LLMs still struggle to generate high-quality interactive HTML applications, despite their advancements in code generation, highlighting a gap that MiniAppBench aims to address.

Zuhao Zhang, Chengyue Yu, Yuante Li +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

3w ago·also Shanghai AI Lab

DexHiL: A Human-in-the-Loop Framework for Vision-Language-Action Model Post-Training in Dexterous Manipulation

Human-in-the-loop learning can now boost dexterous manipulation VLA models by 25%, thanks to a new framework that smartly samples corrective actions and enables real-time intervention.

Yifan Han, Zhongxi Chen, Yuxuan Zhao +5

Multimodal Models Robotics & Embodied AI Tool Use & Agents

3w ago

MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning

Explicitly teaching LVLMs to reason step-by-step with reinforcement learning unlocks state-of-the-art performance on multimodal object-entity relation extraction.

Xiang Yuan, Xu Chu, Xinrong Chen +6Code

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

TA-Mem: Tool-Augmented Autonomous Memory Retrieval for LLM in Long-Term Conversational QA

LLMs can now autonomously retrieve relevant memories from a database using specialized tools, significantly improving performance on long-term conversational question answering.

Mengwei Yuan, Jianan Liu, Jing Yang +4

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Hongbo Bo +23w ago

Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts

Forget RLHF – steering LLM multi-agent conversations might be as simple as crafting the right sequence of prompts.

Hongbo Bo, Jingyu Hu, Weiru Liu

Natural Language Processing Tool Use & Agents

Dechuan Teng +33w ago

ESAinsTOD: A Unified End-to-End Schema-Aware Instruction-Tuning Framework for Task-Oriented Dialog Modeling

Forget dataset-specific hacks: ESAinsTOD leverages instruction and schema alignment to achieve state-of-the-art task-oriented dialogue performance with strong generalization, even in low-resource settings.

Dechuan Teng, Chunlin Lu, Libo Qin +1

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Soroush Seifi +33w ago

Ego: Embedding-Guided Personalization of Vision-Language Models

Forget retraining: Ego personalizes VLMs on the fly by extracting and leveraging visual tokens that represent specific concepts using the model's internal attention.

Soroush Seifi, Simon Gardier, Vaggelis Dorovatas +1

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

Stanford HAI3w ago·also AnsibleHealth Inc., George Washington University

From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring

An AI agent can triage remote patient monitoring data with higher sensitivity than individual clinicians, suggesting a path to scalable and cost-effective patient monitoring.

Seunghwan Kim, Tiffany H. Kung, H. Verma +15

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Shaswata Mitra +43w ago

AgenticCyOps: Securing Multi-Agentic AI Integration in Enterprise Cyber Operations

Securing enterprise multi-agent systems boils down to rigorously controlling tool orchestration and memory management, which can slash exploitable trust boundaries by over 70%.

Shaswata Mitra, Raj Patel, Sudip Mittal +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

MIT CSAIL3w ago·also D visual features, TJU

TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

Zero-shot robotic manipulation is now within reach: TiPToP matches a 350-hour fine-tuned model without *any* robot data.

William Shen, Nishanth Kumar, Sahit Chintalapudi +7

Multimodal Models Robotics & Embodied AI Tool Use & Agents

FAIR CodeGen Team3w ago

Towards a Neural Debugger for Python

LLMs can now emulate debuggers, stepping through code and setting breakpoints, opening the door to more interactive and controllable neural program execution.

Maximilian Beck, Jonas Gehring, Jannik Kossen +1

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

ToolRosetta: Bridging Open-Source Repositories and Large Language Model Agents through Automated Tool Standardization

Automating the messy process of turning open-source code into LLM tools unlocks a new level of agent capabilities, outperforming even commercial LLMs.

Shimin Di, Xujie Yuan, Hanghui Guo +9

Code Generation & Program Synthesis Open-Source Models & Weights Tool Use & Agents

Tiehua Mei +73w ago

Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning

Stop training LLMs on lucky guesses: this new RL method uses the model's own in-context learning ability to identify and upweight high-quality reasoning traces, leading to better performance.

Tiehua Mei, Minxuan Lv, Leiyu Pan +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

3w ago

Latent-DARM: Bridging Discrete Diffusion And Autoregressive Models For Reasoning

By communicating in a shared latent space, Latent-DARM lets you combine the global planning of diffusion models with the fluency of autoregressive models, boosting reasoning accuracy by up to 14% while slashing token usage.

Lina Berrayana, Ahmed Heakl, Abdullah Sohail +1

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

Arash Shahmansoori3w ago

PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution

LLM agents can now achieve a +41pp boost in first-try success and 100% accuracy in 2-way logistics compositions by using PRECEPT's novel combination of retrieval, memory, and prompt evolution.

Arash Shahmansoori

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

3w ago·also CUHK

MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

VLMs can now self-evolve from *zero* data, thanks to a multi-agent RL framework that synthesizes its own visual concepts and reasoning tasks.

Zongxia Li, Hongyang Du, Chengsong Huang +10

Data Curation & Synthetic Data Multimodal Models Tool Use & Agents

3w ago

MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

Even GPT-5 struggles with multi-modal robustness and turn overhead when user personas and multi-modal inputs are considered in agent evaluation, revealing critical gaps in current LLM agent capabilities.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Yoon Jo Kim +93w ago·also Co-corresponding authors

A Guideline-Aware AI Agent for Zero-Shot Target Volume Auto-Delineation

Forget retraining: this guideline-aware AI agent instantly adapts to new radiotherapy protocols, outperforming supervised models in clinical preference.

Yoon Jo Kim, Wonyoung Cho, Jongmin Lee +7

Computer Vision Scientific Discovery & Drug Design Tool Use & Agents

NUS3w ago

MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems

Medical multi-agent systems can reason deeply, but fall apart when switching between medical specialties, highlighting a critical need for more robust architectures.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Naman Gupta +103w ago

Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents

Chain-of-Agents can reason more accurately over long contexts by processing information chunks in an order determined by Chow-Liu dependency trees, rather than relying on default or semantic similarity.

Naman Gupta, Vaibhav Singh, Arun Iyer +8

Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

RecThinker: An Agentic Framework for Tool-Augmented Reasoning in Recommendation

LLM-powered recommendation agents can now autonomously investigate and bridge information gaps, leading to better recommendations, thanks to a new tool-augmented reasoning framework.

Haobo Zhang, Yutao Zhu, Kelong Mao +2

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Jiarun Song +13w ago

Dynamic Multimodal Expression Generation for LLM-Driven Pedagogical Agents: From User Experience Perspective

LLMs can drive pedagogical agents to be more engaging and effective by dynamically generating speech and gestures that align with the semantic context of instructional content.

Jiarun Song, FuZheng Yang

Multimodal Models Natural Language Processing Tool Use & Agents

Tsinghua AI3w ago·also PKU

Video-Based Reward Modeling for Computer-Use Agents

A new video-based reward model beats GPT-5.2 and Gemini-3 Pro at evaluating computer-using agents, offering a scalable, model-agnostic alternative to traditional methods.

Linxin Song, Jieyu Zhang, Huanxin Sheng +6

Computer Vision RLHF & Preference Learning Tool Use & Agents

Won Shik Jang +13w ago

Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation

Skip the costly policy training: this zero-shot method nails text-goal instance navigation by grounding language in 3D geometry for smarter exploration and verification.

Won Shik Jang, Ue-Hwan Kim

Multimodal Models Robotics & Embodied AI Tool Use & Agents

3w ago

MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

Current AI models fall short when asked to understand a situation from the combined perspectives of multiple embodied agents, as revealed by a new challenging benchmark.

Kangsan Kim, Yanlai Yang, Suji Kim +4

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Xiaotian Hu +113w ago

FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis

FetalAgents leapfrogs existing fetal ultrasound analysis tools by dynamically orchestrating specialized AI agents, outperforming monolithic models across diverse clinical tasks and delivering structured clinical reports from video streams.

Xiaotian Hu, Junwei Huang, Mingxuan Liu +9

Computer Vision Multimodal Models Tool Use & Agents

Xiaomi EV3w ago

NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models

By injecting symbolic reasoning into vision-language-action models, NS-VLA achieves remarkable gains in data efficiency and generalization for robotic manipulation.

Ziyue Zhu, Shangyang Wu, Shuai Zhao +4

Multimodal Models Robotics & Embodied AI Tool Use & Agents

UW3w ago

Understanding the Use of a Large Language Model-Powered Guide to Make Virtual Reality Accessible for Blind and Low Vision People

LLM-powered VR guides for blind and low vision users are not just tools, but social actors, prompting users to give them nicknames and rationalize their mistakes when others are present.

Natural Language Processing Tool Use & Agents

Vera V. Vishnyakova3w ago

Context Engineering: From Prompts to Corporate Multi-Agent Architecture

Prompt engineering is dead; long live context engineering—the key to scaling multi-agent AI systems lies in carefully designing the agent's informational environment, not just individual prompts.

Vera V. Vishnyakova

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Tool Use & Agents

3w ago

Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

Retrieval-augmented agents get a serious reasoning boost by explicitly evaluating their own retrieval quality at each step, leading to state-of-the-art performance on multi-hop question answering.

Jiangming Shu, Yuxiang Zhang, Ye Ma +2

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

DeepMind3w ago

Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models

Forget black-box policies: CSRO uses LLMs to generate human-readable code policies in multi-agent RL, achieving performance competitive with traditional methods.

Daniel Hennes, Zun Li, John Schultz +1

Code Generation & Program Synthesis Interpretability & Mechanistic Interp Tool Use & Agents

Xing Chen +63w ago

Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments

LLMs that dominate in strategic reasoning often choke in real-time zero-sum games, revealing a critical strategy-execution gap that current benchmarks miss.

Xing Chen, Yutao Liu, Gege Qi +4

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Communications Research Centre3w ago·also Carleton

AI-Enabled Data-driven Intelligence for Spectrum Demand Estimation

Spectrum regulators can now leverage AI to dynamically plan and allocate spectrum resources, thanks to a new data-driven approach that accurately forecasts demand with high reliability across diverse urban environments.

Colin Brown, Mohamad Alkadamani, Halim Yanikomeroglu

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

David Freire-Obregón3w ago

Emotional Modulation in Swarm Decision Dynamics

Emotional states can bias swarm decision-making, but even symmetric emotional conditions can lead to decisive wins due to non-linear amplification.

David Freire-Obregón

Robotics & Embodied AI Scientific Discovery & Drug Design Tool Use & Agents

Abhishikth Mallampalli +13w ago

MITRA: An AI Assistant for Knowledge Retrieval in Physics Collaborations

Tired of sifting through mountains of internal docs? This RAG system uses a clever two-tiered vector DB to surface the right physics analysis, not just keywords.

Abhishikth Mallampalli, Sridhara Dasu

Recommendation & Information Retrieval Scientific Discovery & Drug Design Tool Use & Agents

Tsinghua AI3w ago·also NJU

TimberAgent: Gram-Guided Retrieval for Executable Music Effect Control

Forget tweaking knobs – this new Gram-matrix-based audio representation lets you *retrieve* the perfect, editable audio effect preset, outperforming standard methods.

Shihao He, Yihan Xia, Fang Liu +2

Recommendation & Information Retrieval Speech & Audio Tool Use & Agents

3w ago

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Stop wrestling with finicky evaluation codebases: One-Eval lets you specify LLM evaluation tasks in natural language and automatically executes them end-to-end.

Chengyu Shen, Yanheng Hou, Minghui Pan +8

Eval Frameworks & Benchmarks Tool Use & Agents

3w ago·also Corresponding author(), Zhongguancun Laboratory

ProvAgent: Threat Detection Based on Identity-Behavior Binding and Multi-Agent Collaborative Attack Investigation

ProvAgent slashes the cost of reconstructing near-complete attack processes to just $0.06 per day by replacing human analysts with a multi-agent system for threat investigation.

Wenhao Yan, Ning An, Linxu Li +6

Red-Teaming & Adversarial Robustness Tool Use & Agents

Jiang Gao +53w ago

PM-Nav: Priori-Map Guided Embodied Navigation in Functional Buildings

Achieve up to 11x navigation performance gains in functional buildings by explicitly encoding and exploiting a priori spatial knowledge.

Jiang Gao, Xiangyu Dong, Haozhou Li +3

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

Tong Wang +43w ago

DataFactory: Collaborative Multi-Agent Framework for Advanced Table Question Answering

LLMs can now tackle complex table QA with 20%+ accuracy gains, thanks to a multi-agent framework that decomposes queries and orchestrates reasoning between specialized database and knowledge graph agents.

Tong Wang, Yongkang Chen, Huan Deng +2

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Stefan Wittek +13w ago

Preparing Students for AI-Driven Agile Development: A Project-Based AI Engineering Curriculum

Forget separate lectures: this AI Engineering curriculum throws students into interdisciplinary agile projects, embedding AI tools directly into their workflows for a hands-on, future-proofed learning experience.

Stefan Wittek, David Inkermann

Code Generation & Program Synthesis Tool Use & Agents

3w ago

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

Forget data quantity, diversity is the secret sauce: scaling the variety of tool-use patterns in training data boosts LLM generalization by +22 points on OOD benchmarks, even with 4x less data.

Aili Chen, Chi Zhang, Junteng Liu +11

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Xiaoxing Wang +23w ago

AutoAgent: Evolving Cognition and Elastic Memory Orchestration for Adaptive Agents

AutoAgent dynamically evolves agent cognition and memory to achieve superior performance in complex, dynamic environments, without requiring external retraining.

Xiaoxing Wang, Shikun Wei, Feiyu Xiong

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

Xin Lu +63w ago·also Manuscript received xxx xx

Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA

EQA agents can now handle dynamic, human-populated scenes better thanks to a training-free method that selectively remembers only the most informative visual evidence.

Xin Lu, Rui Li, Xun Huang +4

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Vladyslav Parakhin3w ago

The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation

Traditional time-based authorization schemes are dangerously slow in multi-agent systems: a new coherence strategy slashes unauthorized API calls by over 100x, offering a velocity-agnostic safety guarantee.

Vladyslav Parakhin

Distributed Systems & Hardware Tool Use & Agents

Ilya Levin3w ago

Vibe-Creation: The Epistemology of Human-AI Emergent Cognition

Human-AI interaction isn't just augmentation, it's a new cognitive entity with its own emergent "vibe," demanding we rethink epistemology and education.

Ilya Levin

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Yinjie Wang +43w ago

OpenClaw-RL: Train Any Agent Simply by Talking

Forget finetuning on curated datasets – OpenClaw-RL lets agents learn directly and continuously from *every* interaction, turning user replies, tool outputs, and even GUI changes into valuable RL signals.

Yinjie Wang, Xuyang Chen, Xiaolong Jin +2

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents

3w ago

Hierarchical Observe-Orient-Decide-Act Enabled UAV Swarms in Uncertain Environments: Frameworks, Potentials, and Challenges

A hierarchical OODA loop architecture can significantly improve the adaptability and efficiency of UAV swarms operating in dynamic, uncertain environments.

Ziye Jia, Yao Wu, Qihui Wu +4

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Department of Applied Artificial3w ago·also Department of Data Science, KETI

Reading the Mood Behind Words: Integrating Prosody-Derived Emotional Context into Socially Responsive VR Agents

VR agents that "listen" to your tone, not just your words, elicit significantly better user experiences.

SangYeop Jeong, Yeongseo Na, Seung Gyu Jeong +2

Natural Language Processing Speech & Audio Tool Use & Agents

Independent Research3w ago

Telogenesis: Goal Is All U Need

Forget external rewards—this agent learns to explore and adapt by prioritizing its own ignorance, surprise, and staleness, outperforming fixed strategies.

Zhuoran Deng, Yizhi Zhang, Ziyi Zhang +1

Tool Use & Agents World Models & Planning

Mar 9, 2026

Tam Nguyen +23w ago

Security Considerations for Multi-agent Systems

Current AI security frameworks are woefully inadequate for multi-agent systems, leaving critical vulnerabilities like non-determinism and data leakage largely unaddressed.

Tam Nguyen, M. Ndebugre, Dheeraj Arremsetty

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago·also ZJU

Long-Short Term Agents for Pure-Vision Bronchoscopy Robotic Autonomy

Achieve expert-level bronchoscopic navigation without external sensors by having a world-model critic arbitrate between reactive and strategic AI agents.

Junyang Wu, Mingyi Luo, Fangfang Xie +7

Computer Vision Robotics & Embodied AI Tool Use & Agents

Tutian Tang +63w ago

Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA

Unlock human-like dexterity in robotic manipulation by combining RL-assisted teleoperation with a novel VLA architecture that leverages force and tactile feedback.

Tutian Tang, Xingyu Ji, Wanli Xing +4

Multimodal Models Robotics & Embodied AI Tool Use & Agents

3w ago·also Cohere, Corresponding Author, Hunan, Tencent AI +2

FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use

LLMs struggle to navigate the complexities of real-world finance, as evidenced by a new benchmark revealing their limitations in timeliness, regulatory compliance, and tool selection across 760 financial APIs.

Jiaxuan Lu, Kong Wang, Yemin Wang +11

Eval Frameworks & Benchmarks Tool Use & Agents

Cornelius Emde +63w ago

MASEval: Extending Multi-Agent Evaluation from Models to Systems

Framework choice in multi-agent systems matters just as much as the LLM itself, a fact obscured by existing model-centric benchmarks.

Cornelius Emde, Alexander Rubinstein, Anmol Goel +4

Eval Frameworks & Benchmarks Tool Use & Agents

3w ago

Characterization, Analytical Planning, and Hybrid Force Control for the Inspire RH56DFX Hand

Turn your Inspire RH56DFX hand from a black box into a research tool with this characterization, simulation, and control pipeline that achieves 87% grasp success on diverse objects.

Xuan Tan, William Xie, N. Correll

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Piyush Gupta +33w ago

Scale-Plan: Scalable Language-Enabled Task Planning for Heterogeneous Multi-Robot Teams

LLMs can be used to prune irrelevant information *before* planning, enabling efficient long-horizon multi-robot task planning that outperforms both pure LLM and hybrid LLM-PDDL approaches.

Piyush Gupta, Sangjae Bae, Jiachen Li +1

Robotics & Embodied AI Tool Use & Agents World Models & Planning

3w ago

RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

LLM agents can learn to continuously adapt and improve in complex environments by reflecting on past experiences and explicitly storing/retrieving reusable lessons, leading to substantial performance gains.

Xiaoying Zhang, Zi-Yan Liu, Zichen Liu +3

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Tzafrir Rehan3w ago

Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

Forget prompt engineering voodoo: this framework treats agent prompts as compiled artifacts, using tests to drive development and catch silent regressions before they hit production.

Tzafrir Rehan

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

3w ago

Arbiter: Detecting Interference in LLM Agent System Prompts

For pennies, a new framework reveals critical vulnerabilities in the system prompts of leading coding agents like Claude, Codex, and Gemini, demonstrating the power of multi-model LLM scouring.

Tony Mason

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

Google Research3w ago·also DeepMind, Babylon Health, Beth Israel Deaconess Medical Center, Beth Israel Lahey Health +5

A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

LLM-powered diagnostic AI is ready for prime time: a real-world clinical trial shows it's safe, patients love it, and doctors find it useful.

P. Brodeur, Peter Brodeur, Jacob M. Koshy +59

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

NUS3w ago·also Horizon Robotics

SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents

By closing the loop with explicit planning and feedback, SPIRAL overcomes the temporal drift and weak semantic grounding plaguing one-shot video generation models.

Multimodal Models Tool Use & Agents World Models & Planning