March 18 – March 25, 2026

Tool Use & Agents - Weekly Roundup

98 papers published across 10 labs.

21% acceleration

Selected Labs publishing this week

CMU ML5 Tsinghua AI3 Microsoft Research2 Amazon Science2 Stanford HAI2

Top Papers

Mar 25, 2026

BAIR1w ago·also Microsoft Research, IIT

Composer 2 Technical Report

Training domain-specific coding LLMs with realistic environments and large-scale RL can yield substantial gains in practical software engineering tasks.

Cursor Reseach Aaron Chan, Ahmed Shalaby, Alexander Wettig +51

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents

CMU ML1w ago·also NUS, Imperial, Oxford, TU Munich

MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

Giving medical imaging AIs the same tools as human doctors actually *hurts* their performance, revealing a surprising lack of spatial reasoning.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Terry Chen +221w ago

AVO: Agentic Variation Operators for Autonomous Evolutionary Search

Autonomous coding agents can now outperform expert-engineered attention kernels on NVIDIA's latest Blackwell GPUs, discovering optimizations that eluded human experts.

Terry Chen, Zhifan Ye, Bing Xu +20

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Mar 20, 2026

Strukto.AI1w ago·also Infron.AI

SEAR: Schema-Based Evaluation and Routing for LLM Gateways

Stop relying on brittle classifiers: SEAR uses LLM reasoning and a unified SQL query layer to evaluate, route, and explain decisions in LLM gateways.

Zecheng Zhang, Han Zheng, Yue Xu

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Mar 19, 2026

Zhixing You +21w ago

D-Mem: A Dual-Process Memory System for LLM Agents

LLM agents can achieve near-perfect memory recall without prohibitive costs by strategically combining fast, lossy retrieval with slower, exhaustive deliberation.

Zhixing You, Jiachen Yuan, Jason Cai

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

All Papers (98)

Mar 25, 2026

BAIR1w ago·also Microsoft Research, IIT

Composer 2 Technical Report

Training domain-specific coding LLMs with realistic environments and large-scale RL can yield substantial gains in practical software engineering tasks.

Cursor Reseach Aaron Chan, Ahmed Shalaby, Alexander Wettig +51

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents

CMU ML1w ago·also NUS, Imperial, Oxford, TU Munich

MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

Giving medical imaging AIs the same tools as human doctors actually *hurts* their performance, revealing a surprising lack of spatial reasoning.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Terry Chen +221w ago

AVO: Agentic Variation Operators for Autonomous Evolutionary Search

Autonomous coding agents can now outperform expert-engineered attention kernels on NVIDIA's latest Blackwell GPUs, discovering optimizations that eluded human experts.

Terry Chen, Zhifan Ye, Bing Xu +20

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Mar 20, 2026

Strukto.AI1w ago·also Infron.AI

SEAR: Schema-Based Evaluation and Routing for LLM Gateways

Stop relying on brittle classifiers: SEAR uses LLM reasoning and a unified SQL query layer to evaluate, route, and explain decisions in LLM gateways.

Zecheng Zhang, Han Zheng, Yue Xu

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Mar 19, 2026

Zhixing You +21w ago

D-Mem: A Dual-Process Memory System for LLM Agents

LLM agents can achieve near-perfect memory recall without prohibitive costs by strategically combining fast, lossy retrieval with slower, exhaustive deliberation.

Zhixing You, Jiachen Yuan, Jason Cai

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

1w ago

AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

GUI agents struggle with long tasks not because they mis-click, but because they forget what they were doing, and a new "anchored memory" method can fix it.

Yi Shi, Jungang Li, Linghao Zhang +25

Eval Frameworks & Benchmarks Tool Use & Agents

Corresponding authors1w ago

Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Closed-loop feedback using VLMs can dramatically improve text-to-image generation quality, even without additional training.

Ping Chen, Daoxuan Zhang, Xiangming Wang +4

Computer Vision Multimodal Models Tool Use & Agents

An Luo +161w ago

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Despite advances in LLMs, human-AI collaboration still significantly outperforms AI-only agents in domain-specific data science tasks, proving that human expertise remains crucial.

An Luo, Jin Du, Xun Xian +14

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

1w ago

Context Bootstrapped Reinforcement Learning

Injecting demonstrations with a carefully annealed probability can drastically improve exploration in RLVR, even for tasks requiring novel reasoning or domain-specific knowledge.

Saaket Agashe, Jayanth Srinivasa, Gaowen Liu +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents+1

Anastasios Manganaris +31w ago

Graph-of-Constraints Model Predictive Control for Reactive Multi-agent Task and Motion Planning

Coordinating multi-robot teams to complete manipulation tasks just got easier: GoC-MPC handles dynamic task assignments and disturbances without training data or environment models.

Anastasios Manganaris, Jeremy Lu, A. H. Qureshi +1

Robotics & Embodied AI Tool Use & Agents World Models & Planning

1w ago·also Penn State, Virginia Tech

Implicit Patterns in LLM-Based Binary Analysis

LLMs analyzing binaries aren't just spitting out tokens – they're exhibiting surprisingly structured reasoning patterns like "early pruning" and "targeted backtracking" that could revolutionize how we understand and control these systems.

Qiang Li, XiangRui Zhang, Haining Wang

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Enrico Bottazzi +11w ago

Security awareness in LLM agents: the NDAI zone case

LLMs can reliably detect danger in secure environments, but they can't reliably verify safety, which breaks privacy-preserving agentic protocols.

Enrico Bottazzi, Pia Park

Red-Teaming & Adversarial Robustness Tool Use & Agents

Dario Compagno +21w ago·also University of Bergen

Teleological Inference in Structural Causal Models via Intentional Interventions

Discovering an agent's hidden intentions is now possible by analyzing their interventions within a causal model, revealing the "why" behind their actions.

Dario Compagno, D. Compagno, Fabio Massimo Zennaro

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

Meta AI1w ago·also Oxford, TU Eindhoven, TU Munich

Agentic Business Process Management: A Research Manifesto

Agentic Business Process Management offers a blueprint for aligning AI agents with organizational goals, moving beyond simple automation to a framework of constrained autonomy.

Diego Calvanese, Angelo Casciani, G. D. Giacomo +18

Natural Language Processing Tool Use & Agents

1w ago·also Amazon Science, NSFC

MemMA: Coordinating the Memory Cycle through Multi-Agent Reasoning and In-Situ Self-Evolution

Memory-augmented LLMs get a strategic upgrade: MemMA uses multi-agent reasoning to proactively guide memory construction and repair, leading to significant performance gains.

Minhua Lin, Min Lin, Zhiwei Zhang +7

Reasoning & Chain-of-Thought Tool Use & Agents

Mila1w ago

Learning to Self-Evolve

Forget prompt engineering – LSE trains LLMs to self-edit their own contexts at test time, outperforming even GPT-5 and Claude Sonnet 4.5 in Text-to-SQL and question answering.

Xiaoyin Chen, Xiaoyin Chen, Canwen Xu +9

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Chuxuan Hu +31w ago

SODIUM: From Open Web Data to Queryable Databases

Automating web data integration for expert querying is now possible: SODIUM-Agent achieves a 2x accuracy boost over existing systems on a new benchmark of 105 real-world tasks.

Chuxuan Hu, Philip Li, Maxwell Yang +1

Data Curation & Synthetic Data Natural Language Processing Recommendation & Information Retrieval+1

Santiago Berrezueta-Guzman +21w ago

Beyond the Code: A Multi-Modal Assessment Strategy for Fostering Professional Competencies via Introductory Programming Projects

Ditch the syntax-only grind: a multi-modal assessment strategy proves that introductory programming courses can boost both coding skills and crucial soft skills like communication and critical thinking.

Santiago Berrezueta-Guzman, Vanesa Metaj, Stefan Wagner

Code Generation & Program Synthesis Multimodal Models Tool Use & Agents

Tsinghua AI1w ago·also IEEE

ATG-MoE: Autoregressive trajectory generation with mixture-of-experts for assembly skill learning

Forget brittle, hand-coded robot assembly routines: ATG-MoE learns complex, multi-skill manipulation directly from visual and language inputs, achieving impressive success rates in both simulation and real-world industrial tasks.

Weihang Huang, Chaoran Zhang, Xiao Deng +6

Architecture Design (Transformers, SSMs, MoE)Robotics & Embodied AI Tool Use & Agents

Neil Fernandes +71w ago

"You've got a friend in me": Co-Designing a Peer Social Robot for Young Newcomers'Language and Cultural Learning

A peer-like social robot can effectively augment literacy tutor support for newcomer children, offering personalized language and cultural learning in resource-constrained community settings.

Neil Fernandes, Cheng Tang, Tehniyat Shahbaz +5

Natural Language Processing Robotics & Embodied AI Tool Use & Agents

NVIDIA1w ago

ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

Training multi-turn LLM agents just got easier: ProRL Agent offers a scalable, API-driven rollout service that streamlines RL training across diverse tasks.

Hao Zhang, Mingjie Liu, Shaokun Zhang +12

Distributed Systems & Hardware RLHF & Preference Learning Tool Use & Agents

Vedanta S P +31w ago

I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems

Forget scaling laws: the *structure* of your AI governance system matters more than the specific LLM when it comes to preventing corruption.

Vedanta S P, P. VedantaS, Ponnurangam Kumaraguru +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

1w ago·also Microsoft Research, Stevens

Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution

LLM agents can slash task completion time by almost 50% simply by predicting and pre-executing likely tool calls.

Yifan Sui, Han Zhao, Rui Ma +4

Inference & Quantization Tool Use & Agents

Haochen Zhao +21w ago

ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation

Weaker autonomous web agents readily trust tampered website content, producing unsafe outputs, while stronger models exhibit better anomaly detection and safer fallback strategies under MITM attacks.

Haochen Zhao, Haocheng Zhao, Shaoyang Cui

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

1w ago

AutORAN: LLM-driven Natural Language Programming for Agile xApp Development

Forget months of manual coding: AutORAN lets you build and deploy O-RAN xApps from natural language in minutes.

Xin Li, Xin Li, Shiming Yu +9

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Chenyang Gu +61w ago

MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

LLMs can generate significantly more novel and technically rigorous scientific ideas by explicitly learning to reason from motivations to methodologies.

Chenyang Gu, Jiahao Cheng, Meicong Zhang +4

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Xiaoyu Liu +11w ago

TopoChunker: Topology-Aware Agentic Document Chunking Framework

RAG systems can achieve state-of-the-art performance by explicitly preserving document topology, outperforming LLM-based chunking while simultaneously reducing token overhead.

Xiaoyu Liu, Xiaoyu Liu

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

1w ago

PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents

Stop leaking your secrets to the cloud: PlanTwin lets LLM agents plan over your private data without actually exposing it.

Guangsheng Yu, Qin Wang, Rui Lang +2

Tool Use & Agents World Models & Planning

Tsinghua AI1w ago·also OPPO, Shenzhen Institutes of Advanced

Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA

AI can now handle the tedious copywriting and real-time Q&A for live-streaming commerce, freeing up human streamers to focus on engagement.

Ruizhi Yu, Keyang Zhong, Peng Liu +5

Multimodal Models Natural Language Processing Tool Use & Agents

CMU ML1w ago

ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

Even GPT-5 and Gemini 2.5 Pro still fail to efficiently couple reasoning with tool use, requiring up to 2.7x more tool calls than theoretically optimal in a new diagnostic environment.

Wanjia Zhao, Ludwig Schmidt, James Zou +2

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Eduardo Di Santi1w ago

Cognitive Amplification vs Cognitive Delegation in Human-AI Systems: A Metric Framework

Blindly maximizing human-AI performance can degrade human expertise over time, revealing a critical trade-off that demands a new approach to system design.

Eduardo Di Santi

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Nitay Alon +41w ago

Proceedings of the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind

A snapshot of the cutting-edge research uniting Theory of Mind and AI, all in one open-access collection.

Nitay Alon, Joseph M. Barnby, Reuth Mirsky +2

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Oliver Cory +21w ago

SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation

Automating linguistically-grounded sign language annotation is now possible, unlocking scalable dataset curation previously limited by manual effort.

Oliver Cory, Ozge Mercanoglu Sincan, Richard Bowden

Data Curation & Synthetic Data Multimodal Models Tool Use & Agents

1w ago·also Radboud, UvA

Total Recall QA: A Verifiable Evaluation Suite for Deep Research Agents

Current benchmarks fail to rigorously evaluate deep research agents, but a new framework leveraging structured knowledge bases and synthetic data offers a verifiable and scalable solution.

Mahta Rafiee, Heydar Soudani, Zahra Abbasiantaeb +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval+1

1w ago·also CUHK, HKUST, Shanghai AI Lab

OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

Decomposing GUI agent trajectories into verifiable milestones and auditing the evidence chain yields a 10% boost in RL training performance, outperforming single-judge reward systems.

Zehao Li, Zhenyu Wu, Zhenyu Wu +23

RLHF & Preference Learning Robotics & Embodied AI Tool Use & Agents

Huichi Zhou +171w ago·also Jilin Univerisity

Memento-Skills: Let Agents Design Agents

Forget hand-crafting agents: Memento-Skills lets a generalist LLM agent autonomously design and improve specialized agents through experience, achieving substantial gains on complex benchmarks.

Huichi Zhou, Siyuan Guo, Anjie Liu +15

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Zhuofan Li +91w ago

From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Seemingly efficient VLA models can be surprisingly inefficient when deployed on robots, highlighting the need to move beyond standard metrics like FLOPs and parameters.

Zhuofan Li, Zhuo Li, Hongkun Yang +7

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Xiao Feng +71w ago

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

Skip the expensive reward model: RewardFlow distills sparse task rewards into dense, state-level signals by propagating credit through the topology of LLM reasoning trajectories.

Xiao Feng, Bo Han, Zhanke Zhou +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Xiucheng Wang +21w ago

BeamAgent: LLM-Aided MIMO Beamforming with Decoupled Intent Parsing and Alternating Optimization for Joint Site Selection and Precoding

LLMs can orchestrate complex wireless communication optimization tasks by translating natural language intent into actionable spatial constraints, enabling gradient-based solvers to outperform traditional methods without requiring domain-specific fine-tuning.

Xiucheng Wang, Yue Zhang, Nan Cheng

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Zhicong Lu +101w ago

HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning

Aligning rewards with sub-goals and emphasizing key trajectory segments with hindsight information significantly improves multi-turn agentic RL, outperforming existing methods on complex tasks.

Zhicong Lu, Zichuan Lin, Wei Jia +8

RLHF & Preference Learning Tool Use & Agents

Fengxiaoxiao Li +81w ago

CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman Problem

Neural solvers can now effectively handle the complexities of multi-agent coordination and multi-objective trade-offs in routing problems, outperforming traditional heuristics.

Fengxiaoxiao Li, Xiao Mao, Mingfeng Fan +6

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Shiliang Zhang +11w ago

Security, privacy, and agentic AI in a regulatory view: From definitions and distinctions to provisions and reflections

EU's AI regulations struggle to keep pace with agentic AI, blurring the lines of security and privacy.

Shiliang Zhang, Sabita Maharjan

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Gaoxiang Cao +61w ago

Bridging Network Fragmentation: A Semantic-Augmented DRL Framework for UAV-aided VANETs

Forget blind exploration: injecting LLM-derived semantic understanding into DRL dramatically boosts UAV-aided network connectivity and slashes energy consumption.

Gaoxiang Cao, Wenke Yuan, Huasen He +4

Robotics & Embodied AI Tool Use & Agents

KT Tech innovation Group1w ago

Mi:dm K 2.5 Pro

Forget scaling laws: Mi:dm K 2.5 Pro proves that targeted training pipelines and data curation can enable a 32B parameter model to achieve state-of-the-art performance in enterprise reasoning tasks, especially in low-resource languages like Korean.

KT Tech innovation Group

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Marcelo Fernandez1w ago

Agent Control Protocol: Admission Control for Agent Actions

Guaranteeing secure and compliant agent behavior in B2B environments may finally be within reach thanks to a new cryptographic admission control protocol.

Marcelo Fernandez

Constitutional AI & AI Ethics Tool Use & Agents

Zachery Allen +41w ago

Robotic Agentic Platform for Intelligent Electric Vehicle Disassembly

LLMs can control robots for complex disassembly tasks, but only if you give them structured APIs – otherwise, expect a 43% failure rate.

Zachery Allen, Max Conway, Lyle Antieau +2

Computer Vision Robotics & Embodied AI Tool Use & Agents

Mar 18, 2026

2w ago

From Digital Twins to World Models:Opportunities, Challenges, and Applications for Mobile Edge General Intelligence

Ditching rigid digital twins for adaptable world models could unlock truly intelligent edge computing in 6G networks.

Dusit Niyato, Changyuan Zhao, Jiawen Kang +1

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Ruixiao Shi +32w ago

A Creative Agent is Worth a 64-Token Template

Unleash creativity in text-to-image models with a single, reusable 64-token template, sidestepping costly iterative prompt engineering and reasoning.

Ruixiao Shi, Fu Feng, Yucheng Xie +1

Computer Vision Multimodal Models Tool Use & Agents

Qianpu Chen +32w ago

In Trust We Survive: Emergent Trust Learning

Forget complex communication protocols – this trust-based algorithm lets agents learn to cooperate in competitive environments with minimal overhead.

Qianpu Chen, Giulio Barbero, Mike Preuss +1

Robotics & Embodied AI Tool Use & Agents

Michel Schimpf +22w ago

AI-Assisted Goal Setting Improves Goal Progress Through Social Accountability

AI career coaches can boost short-term goal progress not just through reflection, but by making users feel more socially accountable.

Michel Schimpf, Julian Voigt, Thomas Bohné

Natural Language Processing Tool Use & Agents

Young-Bin Park +12w ago

Graph-Native Cognitive Memory for AI Agents: Formal Belief Revision Semantics for Versioned Memory Architectures

Forget finetuning – Kumiho's graph-native memory lets you swap in a better LLM and instantly double your agent's reasoning accuracy on complex cognitive tasks.

Young-Bin Park, Young Bin Park

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

2w ago

A Unified Language Model for Large Scale Search, Recommendation, and Reasoning

Forget tool-augmented systems: NEO shows you can consolidate search, recommendation, and reasoning into a single language-steerable LLM by representing items as SIDs and interleaving them with natural language.

Marco De Nadai, Edoardo D'Amico, Max Lefarov +23

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

2w ago

Proactive Knowledge Inquiry in Doctor-Patient Dialogue: Stateful Extraction, Belief Updating, and Path-Aware Action Planning

Instead of passively transcribing doctor-patient dialogues, this system actively models what's known, what's missing, and what questions to ask next, paving the way for more intelligent EMR systems.

Zhenhai Pan, Yan Liu, Jia You

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Stanford HAI2w ago

ReSteer: Quantifying and Refining the Steerability of Multitask Robot Policies

Robots often ignore your commands mid-task, but ReSteer offers a way to fix this by pinpointing and patching the "blind spots" in their training data.

Zhenyang Chen, Alan Tian, Alan Tian +12

Eval Frameworks & Benchmarks Robotics & Embodied AI Tool Use & Agents

2w ago

AERR-Nav: Adaptive Exploration-Recovery-Reminiscing Strategy for Zero-Shot Object Navigation

Robots can now nimbly navigate complex, multi-floor environments without prior training, thanks to a new strategy that dynamically switches between exploration, recovery, and memory recall.

Jing Huang, Jingzhi Huang, Junkai Huang +4

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Amazon Science2w ago

LAAF: Logic-layer Automated Attack Framework A Systematic Red-Teaming Methodology for LPCI Vulnerabilities in Agentic Large Language Model Systems

Agentic LLMs are surprisingly vulnerable: a new framework finds successful attacks in 84% of attempts by escalating prompt injection techniques across multiple stages.

Hammad Atta, Hammad Atta, Ken Huang +25

Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness Tool Use & Agents

Jiashun Liu +12w ago

Complementary Reinforcement Learning

RL agents can learn far more efficiently by dynamically distilling and leveraging past experiences that co-evolve with the agent's growing capabilities.

Jiashun Liu, Bo Zheng

RLHF & Preference Learning Tool Use & Agents Training Efficiency & Optimization

2w ago

A Multi-Agent System for Building-Age Cohort Mapping to Support Urban Energy Planning

A multi-agent LLM system can fuse heterogeneous data sources to accurately classify building ages from satellite imagery, enabling better urban energy planning despite class imbalances in historical building cohorts.

Kundan Thota, Thorsten Schlachter, Veit Hagenmeyer

Natural Language Processing Tool Use & Agents

Hao Ma +32w ago

Efficient Soft Actor-Critic with LLM-Based Action-Level Guidance for Continuous Control

LLMs can act as effective action-level supervisors in reinforcement learning, dramatically boosting the sample efficiency of SAC without sacrificing convergence guarantees.

Hao Ma, Zhiqiang Pu, Xiaolin Ai +1

RLHF & Preference Learning Robotics & Embodied AI Tool Use & Agents

Xinyang Gong +52w ago

ShuttleEnv: An Interactive Data-Driven RL Environment for Badminton Strategy Modeling

Forget rigid physics engines, this badminton RL environment uses real player data to simulate realistic rallies and strategic gameplay.

Xinyang Gong, Bozhou Chen, Yunlong Lu +3

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Aivo Olev +22w ago·also TalTech

Multi-Source Evidence Fusion for Audio Question Answering

Grounding LALM reasoning in diverse, reliability-weighted acoustic evidence blows away the competition in Audio Question Answering, proving that verifiable chains beat black boxes.

Aivo Olev, Tanel Alumäe, Tanel Alumae

Reasoning & Chain-of-Thought Speech & Audio Tool Use & Agents

Universidad ORT Uruguay2w ago

TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

Simply prompting for test-driven development can *increase* regressions in AI coding agents; instead, focus on surfacing contextual information about which tests are most relevant.

Pepe Alonso

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

CMU ML2w ago·also NII

RPMS: Enhancing LLM-Based Embodied Planning through Rule-Augmented Memory Synergy

LLMs in embodied environments get a massive boost from structured rules, with rule retrieval alone contributing +14.9 pp to single-trial success.

Zhenhang Yuan, Shenghai Yuan, Lihua Xie

Robotics & Embodied AI Tool Use & Agents World Models & Planning

2w ago

Differential Privacy in Generative AI Agents: Analysis and Optimal Tradeoffs

Forget prompt privacy – your LLM's responses are leaking *enterprise data*, and this paper shows how to quantify and control it.

Ya-Ting Yang, Quanyan Zhu

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Stanford HAI2w ago

Deployment and Evaluation of an EHR-integrated, Large Language Model-Powered Tool to Triage Surgical Patients

Automating surgical patient triage with an LLM achieves 94% sensitivity, but discrepancies reveal more about clinical workflow gaps than AI errors.

Janelle B. Wang, T. Keyes, April S. Liang +11

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

2w ago

GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System

Forget training wheels: GoalVLM lets multi-agent robots navigate to any object you describe, no pre-programmed categories needed.

MoniJesu James, M. James, Amir Atef Habel +5

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Hamed Taheri2w ago

Governed Memory: A Production Architecture for Multi-Agent Workflows

Enterprise AI can achieve 50% token reduction and zero cross-entity leakage by implementing a shared, governed memory architecture for multi-agent workflows.

Hamed Taheri

Architecture Design (Transformers, SSMs, MoE)Constitutional AI & AI Ethics Tool Use & Agents

2w ago

Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety

Current LLM agent safety benchmarks are missing over 20% of unsafe behaviors, even after agents pass the benchmark.

Xuan Chen, Lu Yan, Ruqi Zhang +1

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Akshey Sigdel +12w ago

Guardrails as Infrastructure: Policy-First Control for Tool-Orchestrated Workflows

Tool-using agents are failing in predictable ways, but a model-agnostic policy layer can measurably improve their safety and reliability, albeit with a clear utility tradeoff.

Akshey Sigdel, Rista Baral

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Sriram Gopalakrishnan2w ago

Don't Vibe Code, Do Skele-Code: Interactive No-Code Notebooks for Subject Matter Experts to Build Lower-Cost Agentic Workflows

Forget complex multi-agent systems: Skele-Code's no-code interface slashes token costs by shifting agent involvement to code generation only, enabling subject matter experts to build agentic workflows directly.

Sriram Gopalakrishnan

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Hadil Ben Amor +42w ago

MLmisFinder: A Specification and Detection Approach of Machine Learning Service Misuses

Despite the ease of integrating ML cloud services, developers are widely misusing them, leading to quality and maintainability issues that MLmisFinder can now automatically detect with high accuracy.

Hadil Ben Amor, Niruthiha Selvanayagam, Manel Abdellatif +2

Code Generation & Program Synthesis Tool Use & Agents

2w ago

Bootstrapping Coding Agents: The Specification Is the Program

Forget about chasing the perfect model architecture – this work suggests the real key to better AI agents lies in crafting more precise and complete specifications, since the implementation can always be re-generated.

M. Monperrus, Martin Monperrus

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Akshat Rana +32w ago

SG-CoT: An Ambiguity-Aware Robotic Planning Framework using Scene Graph Representations

Scene graphs plus LLMs let robots ask clarifying questions, boosting multi-agent task success by 15%.

Akshat Rana, Peeyush Agarwal, Krishan Rana +1

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

2w ago·also Northeastern, Punch Cyber Analytics

Retrieval-Augmented LLMs for Security Incident Analysis

LLMs armed with RAG can reconstruct cyberattacks with high precision and recall, but the best model for the job depends on your budget: DeepSeek V3 matches Claude Sonnet 4's accuracy at 1/15th the cost.

Xavier Cadet, Xavier Cadet, Aditya Vikram Singh +14

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Yi Yu +62w ago·also Fudan

Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise Reduction

Achieve SOTA LLM alignment in complex technical domains with a fraction of the compute by distilling knowledge into smaller models using a hybrid reward mechanism and targeted data augmentation.

Yi Yu, Junzhuo Ma, Chenghuang Shen +4

Natural Language Processing Tool Use & Agents Training Efficiency & Optimization

School of Mechanical Engineering2w ago·also ASU

Access Controlled Website Interaction for Agentic AI with Delegated Critical Tasks

Fine-grained access control for websites can finally enable safe and reliable delegation of critical tasks to AI agents.

Sunyoung Kim, Hokeun Kim

Constitutional AI & AI Ethics Tool Use & Agents

Joohyoung Jeon +12w ago

Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization

LLM-powered trading agents can still achieve a Sharpe ratio of 1.40 even when completely blindfolded to ticker symbols and company names, suggesting genuine understanding of market dynamics.

Joohyoung Jeon, Hongchul Lee

Eval Frameworks & Benchmarks Tool Use & Agents

2w ago

Retrieval-Augmented LLM Agents: Learning to Learn from Experience

Retrieval-augmented LLM agents can learn to learn from experience, achieving significantly better generalization on unseen tasks by combining the strengths of fine-tuning and in-context retrieval.

Thomas Palmeira Ferraz, Romain Deffayet, Vassilina Nikoulina +2

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Philipp Normann +42w ago

Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards

A 4B parameter model can nearly match the privilege escalation performance of a state-of-the-art closed LLM like Claude Opus, while being fully local and 100x cheaper to run.

Philipp Normann, Andreas Happe, A. Happe +2

Code Generation & Program Synthesis Open-Source Models & Weights Red-Teaming & Adversarial Robustness+1

Alexander V. Shenderuk-Zhidkov +22w ago

Large Language Models as a Semantic Interface and Ethical Mediator in Neuro-Digital Ecosystems: Conceptual Foundations and a Regulatory Imperative

LLMs acting as semantic interfaces to our brains pose unprecedented ethical risks to mental autonomy and neurorights, demanding a new "second-order neuroethics."

Alexander V. Shenderuk-Zhidkov, A. E. Hramov, Alexander E. Hramov

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Yusen Wu +22w ago

MALLES: A Multi-agent LLMs-based Economic Sandbox with Consumer Preference Alignment

LLMs can be economically aligned to real-world consumer preferences via post-training on transaction data, enabling more accurate and stable economic simulations.

Yusen Wu, Yiran Liu, Xiaotie Deng

Tool Use & Agents World Models & Planning

Saikat Maiti2w ago

Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare

Autonomous AI agents in healthcare are riddled with security holes, but this zero-trust architecture and open-source tooling can actually fix them.

Saikat Maiti

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Yi Nian +22w ago

When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution

You can now audit multi-agent LLM systems and trace responsibility for harmful outputs even without access to internal execution logs, thanks to a clever "self-describing text" technique.

Yi Nian, Haosen Cao, Qingqing Luan

Interpretability & Mechanistic Interp Natural Language Processing Tool Use & Agents

Mohsen Arjmandi2w ago

Sensi: Learn One Thing at a Time -- Curriculum-Based Test-Time Learning for LLM Game Agents

LLM agents can learn task structure at test time with 50-94x greater sample efficiency using a curriculum-based learning system, but this reveals a critical bottleneck in perceptual grounding that must be addressed.

Mohsen Arjmandi

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

2w ago·also PolyU

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

Forget prompt engineering: AgentFactory lets LLM agents self-evolve by accumulating and refining executable Python subagents, making task re-execution more reliable and efficient.

Zhang Zhang, Shuqi Lu, Hongjin Qian +2

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Yuntong Zhang +22w ago·also Max-Planck Insitute of Security and Privacy

VeriGrey: Greybox Agent Validation

Grey-box fuzzing of LLM agents, guided by tool invocation sequences, reveals significantly more prompt injection vulnerabilities and malicious behaviors than black-box testing alone.

Yuntong Zhang, Sungmin Kang, Marcel Böhme

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Abhijeet Sahu +22w ago

Network and Device Level Cyber Deception for Contested Environments Using RL and LLMs

Forget static honeypots – LLMs and RL could make cyber deception dynamic and adaptive, turning the tables on attackers in contested environments.

Abhijeet Sahu, Shuva Paul, Rochard Macwan

Natural Language Processing Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago

Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

Symphony's cognitively-inspired multi-agent system significantly boosts long-form video understanding by mimicking human reasoning, achieving state-of-the-art results on multiple benchmarks.

Haiyang Yan, Hongyun Zhou, Xiaoxue Feng +1

Computer Vision Multimodal Models Tool Use & Agents

Yi Ting Shen +32w ago

MCP-38: A Comprehensive Threat Taxonomy for Model Context Protocol Systems (v1.0)

Existing threat models fail to capture the unique vulnerabilities of Model Context Protocol systems, but MCP-38 fills this gap with a comprehensive taxonomy of 38 distinct threat categories.

Yi Ting Shen, Kentaroh Toyoda, Alex Leung +1

Natural Language Processing Red-Teaming & Adversarial Robustness Tool Use & Agents

Mohamed Eltahir +52w ago

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Forget collapsing videos into text – this hierarchical grid lets you zoom into any moment with lossless visual fidelity, unlocking logarithmic compute scaling for long-form video understanding.

Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models+1

University of Limerick2w ago·also Cloud ERP -UX Foundation

A Contextual Help Browser Extension to Assist Digital Illiterate Internet Users

Digital literacy gaps shrink as a browser extension slashes information retrieval time by 87% using an AI-powered tooltip that defines technical acronyms on demand.

Christos Koutsiaris

Natural Language Processing Tool Use & Agents

CMU ML2w ago·also INSA Rennes

CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents

Forget specialized tools: a standard Unix terminal and clever RL are all you need to beat much larger LLMs at code search.

Lintang Sutawika, Aditya Bharat Soni, R. BharathSriraamR +11

Code Generation & Program Synthesis Recommendation & Information Retrieval Tool Use & Agents

2w ago

Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs

Generalizing RL to continuous state and action spaces just got easier: this paper introduces an operator-theoretic framework and PPO-type algorithms that ditch finite-state assumptions.

Abhishek Gupta, Aditya Mahajan

Robotics & Embodied AI Tool Use & Agents

2w ago

Agentic Cognitive Profiling: Realigning Automated Alzheimer's Disease Detection with Clinical Construct Validity

LLMs can achieve state-of-the-art Alzheimer's detection by mimicking clinical cognitive assessment protocols, not just learning statistical patterns.

Jiawen Kang, Kun Li, Dongrui Han +5

Natural Language Processing Tool Use & Agents

CMU ML2w ago

OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms

LLMs can navigate complex 3D environments more effectively and with far fewer tokens by using a hierarchical scene graph representation derived from omnidirectional sensor data.

Zhongyuan Liu, Zhongyuang Liu, Min He +6

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Tsinghua AI2w ago

VeriAgent: A Tool-Integrated Multi-Agent System with Evolving Memory for PPA-Aware RTL Code Generation

LLMs can now generate Verilog code that's not just correct, but also optimized for real-world hardware constraints like power, performance, and area, thanks to a novel multi-agent system with evolving memory.

Yaoxiang Wang, Qiaolin Shi, Qi Shi +8

Code Generation & Program Synthesis Tool Use & Agents

2w ago·also Lenovo, PKU

AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

AdaZoom-GUI achieves SOTA GUI grounding by adaptively zooming in on small elements and refining ambiguous instructions, outperforming even larger models.

Siqi Pei, Liang Tang, Tiaonan Duan +7

Computer Vision Multimodal Models Tool Use & Agents

Zihao Xin +72w ago

AgentVLN: Towards Agentic Vision-and-Language Navigation

VLMs can now drive embodied agents to navigate complex environments with unprecedented efficiency, thanks to a novel framework that bridges the gap between 2D semantic understanding and 3D spatial reasoning.

Zihao Xin, Wentong Li, Yixuan Jiang +5

Multimodal Models Robotics & Embodied AI Tool Use & Agents