May 1 – May 8, 2026

Tool Use & Agents - Weekly Roundup

89 papers published across 5 labs.

Selected Labs publishing this week

Stanford HAI2 Microsoft Research1 Google Research1 BAIR1 UW1

Top Papers

May 6, 2026

2w ago·also CUHK, HKU, University of California

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

OpenSearch-VL offers a fully transparent recipe for training state-of-the-art multimodal search agents, finally democratizing access to a capability previously locked behind closed doors.

Shuang Chen, Kaituo Feng, Hangting Chen +7

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

Freyaa Chawla +42w ago

Human-AI Co-Mentorship in Project-Based Learning: A Case Study in Financial Forecasting

AI co-mentorship lets high schoolers build real-world financial models, skipping the classroom grind and diving straight into problem-solving.

Freyaa Chawla, Ahan Chawla, Rishi Singh +2

Natural Language Processing Tool Use & Agents

Tianshu Zhu +102w ago

Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

Maximizing reward entropy by targeting a 50% pass rate in binary-reward RL unlocks significant speedups and performance gains in agentic tasks.

Tianshu Zhu, Wenyu Zhang, Xiaoying Zuo +8

Tool Use & Agents Training Efficiency & Optimization

2w ago

Agentic Vulnerability Reasoning on Windows COM Binaries

An agentic pipeline can autonomously discover and verify real-world privilege escalation vulnerabilities in Windows COM binaries, outperforming both static analysis and existing coding agents.

Hwiwon Lee, Jongseong Kim, Lingming Zhang

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago·also Vienna

Adaptivity Under Realizability Constraints: Comparing In-Context and Agentic Learning

ReLU network constraints can flip the script on whether adaptive querying helps in-context learning.

Anastasis Kratsios, A. Martina Neuman, Philipp Petersen

Tool Use & Agents Training Efficiency & Optimization

All Papers (89)

May 6, 2026

2w ago·also CUHK, HKU, University of California

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

OpenSearch-VL offers a fully transparent recipe for training state-of-the-art multimodal search agents, finally democratizing access to a capability previously locked behind closed doors.

Shuang Chen, Kaituo Feng, Hangting Chen +7

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

Freyaa Chawla +42w ago

Human-AI Co-Mentorship in Project-Based Learning: A Case Study in Financial Forecasting

AI co-mentorship lets high schoolers build real-world financial models, skipping the classroom grind and diving straight into problem-solving.

Freyaa Chawla, Ahan Chawla, Rishi Singh +2

Natural Language Processing Tool Use & Agents

Tianshu Zhu +102w ago

Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

Maximizing reward entropy by targeting a 50% pass rate in binary-reward RL unlocks significant speedups and performance gains in agentic tasks.

Tianshu Zhu, Wenyu Zhang, Xiaoying Zuo +8

Tool Use & Agents Training Efficiency & Optimization

2w ago

Agentic Vulnerability Reasoning on Windows COM Binaries

An agentic pipeline can autonomously discover and verify real-world privilege escalation vulnerabilities in Windows COM binaries, outperforming both static analysis and existing coding agents.

Hwiwon Lee, Jongseong Kim, Lingming Zhang

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago·also Vienna

Adaptivity Under Realizability Constraints: Comparing In-Context and Agentic Learning

ReLU network constraints can flip the script on whether adaptive querying helps in-context learning.

Anastasis Kratsios, A. Martina Neuman, Philipp Petersen

Tool Use & Agents Training Efficiency & Optimization

Hong Kong JC STEM Lab of Smart City2w ago·also Fudan, HKU, HUST, Lingnan University +2

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

Finally, a way to train LLM agents to reason step-by-step without needing humans to check every intermediate thought.

Senkang Hu, Yong Dai, Xudong Han +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Anvay Shah +22w ago

On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

Exponentially many policies in Tree MDPs don't have to mean exponential computation: clever confidence bounds let you treat policy selection as a tractable bandit problem.

Anvay Shah, Ramsundar Anandanarayanan, Sharayu Moharir

Tool Use & Agents World Models & Planning

National Central University2w ago·also National Dong Hwa University, Universitas Negeri Yogyakarta

Cognitive Twins: Investigating Personalized Thinking Model Building and Its Performance Enhancement with Human-in-the-Loop

LLMs can construct interpretable, multi-layered models of individual student cognition from journal entries, opening new possibilities for personalized education.

Wu-Yuin Hwang, Nur Alif Ilyasa, Muhammad Irfan Luthfi +1

Interpretability & Mechanistic Interp Natural Language Processing Tool Use & Agents

2w ago·also Equal Core Contributions

LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

Forget dumb context stuffing: LongSeeker shows that strategically *editing* its own memory lets agents solve web search tasks with far greater reliability.

Yijun Lu, Rui Ye, Yuwen Du +3

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

The Verkor Team +32w ago

Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours

LLM agents can now autonomously design complex hardware like an LLM inference accelerator with hard-wired TurboQuant support in just 80 hours.

The Verkor Team, Ravi Krishna, Suresh Krishna +1

Code Generation & Program Synthesis Inference & Quantization Tool Use & Agents

Sergey Rodionov2w ago

Executable World Models for ARC-AGI-3 in the Era of Coding Agents

Verifier-driven executable world models can solve complex reasoning tasks like ARC-AGI-3 without game-specific code, hinting at a path towards more generalizable AI agents.

Sergey Rodionov

Code Generation & Program Synthesis Tool Use & Agents World Models & Planning

Zhiqing Cui +132w ago

Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

LLM multi-agent systems can achieve significantly higher accuracy at a fraction of the cost by learning to selectively delegate tasks instead of relying on rigid orchestration.

Zhiqing Cui, Haotong Xie, Jiahao Yuan +11

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

2w ago

Architectural Constraints Alignment in AI-assisted, Platform-based Service Development

Stop brittle, undeployable AI-generated code: this retrieval-augmented scaffolding method bakes in architectural constraints from the start.

Julius Irion, Moritz Leugers, Paul Hartwig +5

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Tool Use & Agents

2w ago·also HKUST

Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation

Coordinating LLM agents with evolving knowledge graphs, rather than just text, unlocks superior scientific ideation, beating state-of-the-art systems on multiple benchmarks.

Jiangwen Dong, Bo Li, Wanyu Lin

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Yidong He +62w ago

Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games

LLMs can learn to play multi-agent games far better by recursively modeling the reasoning of other players, leading to a 22% performance boost.

Yidong He, Yutao Lai, Pengxu Yang +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Joshua Adler +12w ago

Storage Is Not Memory: A Retrieval-Centered Architecture for Agent Recall

Ditch the vector DB – this new agent architecture achieves SOTA memory recall by storing everything verbatim and optimizing retrieval, all in a single SQLite file.

Joshua Adler, Guy Zehavi

Architecture Design (Transformers, SSMs, MoE)Recommendation & Information Retrieval Tool Use & Agents

Stanford HAI2w ago

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

AI agents are shockingly easy to manipulate into leaking API keys, deleting user data, and initiating unauthorized transactions across a wide range of real-world applications.

Zhaorun Chen, Xun Liu, Haibo Tong +14

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Chenglin Yang2w ago

AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

Stop waiting for AI agents to mess up: AgentTrust intercepts tool calls *before* execution, offering a chance to block, warn, or fix risky actions in real-time.

Chenglin Yang

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Álvaro Becerra +22w ago·also School of Engineering

AICoFe: Implementation and Deployment of an AI-Based Collaborative Feedback System for Higher Education

Teachers can now scalably provide high-quality, personalized feedback to students by leveraging a multi-LLM system that synthesizes rubric data and qualitative observations, while retaining control through a teacher-in-the-loop workflow.

Álvaro Becerra, A. Palma, Ruth Cobos

Natural Language Processing Open-Source Models & Weights Tool Use & Agents

Miao Wang +72w ago

Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing

Forget stilted, unconvincing VR characters: EBM-RL's novel reward decomposition finally makes video-grounded role-playing dialogue feel immersive.

Miao Wang, Yuling Shi, Yijiang Li +5

Natural Language Processing RLHF & Preference Learning Tool Use & Agents

Álvaro Becerra +22w ago·also School of Engineering

AISSA: Implementation and Deployment of an AI-based Student Slides Analysis tool for Academic Presentations

Automating rubric-based feedback on presentation slides is now feasible and perceived as useful, thanks to LLMs and learning analytics dashboards.

Álvaro Becerra, Diego Gómez, Ruth Cobos

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

2w ago

CodeEvolve: LLM-Driven Evolutionary Optimization with Runtime-Enriched Target Selection for Multi-Language Code Enhancement

LLM-guided code evolution, when combined with runtime feedback and MCTS, can reliably achieve 15x speedups on real-world Java code, surpassing naive LLM-based optimization.

Ajay Krishna Borra, Wenzhuo Yang, Samarth Arora +9

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

2w ago

AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair

Agent-repair leaderboards are more fragile than we thought: methods that peek at the evaluator's signals to guide internal repair choices can cause drastic reordering when the evaluator changes.

Yuelin Hu, Zhenbo Yu, Zhengxue Cheng +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

2w ago·also CUHK, Hubei University, Shenzhen MSU-BIT University

SensingAgents: A Multi-Agent Collaborative Framework for Robust IMU Activity Recognition

LLM-powered multi-agent collaboration can boost zero-shot IMU activity recognition accuracy by 29% compared to existing agent models, even surpassing deep learning baselines.

Naiyu Zheng, Tianlong Yu, Haochen Yin +3

Robotics & Embodied AI Tool Use & Agents

J. Spieler +12w ago

Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

Gradient-based MPC can finally beat gradient-free methods in continuous control, thanks to Dream-MPC's clever combination of learned policies, world models, uncertainty regularization, and optimization amortization.

J. Spieler, Sven Behnke

Robotics & Embodied AI Tool Use & Agents World Models & Planning

2w ago

Accountable Agents in Software Engineering: An Analysis of Terms of Service and a Research Roadmap

AI coding assistants' Terms of Service overwhelmingly place responsibility for code correctness, safety, and legal compliance on the user, creating a potential accountability gap as these tools become more autonomous.

Christoph Treude

Code Generation & Program Synthesis Constitutional AI & AI Ethics Tool Use & Agents

Kuan-Hao Tseng +52w ago·also Sydney

SADE: Symptom-Aware Diagnostic Escalation for LLM-Based Network Troubleshooting

LLMs can leapfrog current network troubleshooting benchmarks by explicitly encoding structured diagnostic policies, rather than relying on free-form deliberation.

Kuan-Hao Tseng, Niruth Bogahawatta, Yasod Ginige +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

2w ago

DAO-enabled decentralized physical AI: A new paradigm for human-machine collaboration

DAOs could unlock a new era of human-machine collaboration by democratizing the operation and governance of physical-digital systems.

M. Ballandies, Florian Spychiger, Uwe Serdult +1

Constitutional AI & AI Ethics Robotics & Embodied AI Tool Use & Agents

L. Boussioux +32w ago

Predictive and Prescriptive AI toward Optimizing Wildfire Suppression

Optimizing wildfire suppression via integer programming and machine learning can significantly reduce burned areas and improve resource allocation, offering a data-driven approach to a critical real-world problem.

L. Boussioux, Alexandre Jacquillat, R. Reger +1

Scientific Discovery & Drug Design Tool Use & Agents World Models & Planning

Yaxun Dai +82w ago

Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL

Tool-using SQL agents can learn to be more efficient and accurate by getting feedback on *how* they reason, not just *what* they output.

Yaxun Dai, Baolin Sun, Junying Wang +6

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Zongqi Cui +12w ago

Distilling Bayesian Belief States into Language Models for Auditable Negotiation

You can distill interpretable Bayesian reasoning about opponent preferences into an 8B language model, outperforming much larger models and enabling detailed auditability of negotiation strategies.

Zongqi Cui, Baihan Lin

Inference & Quantization Natural Language Processing Tool Use & Agents

Zhenliang Zhang +62w ago

SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States

Achieve 8x token reduction in million-token document understanding without sacrificing accuracy by having the LLM actively search for relevant information like a foraging animal.

Zhenliang Zhang, Wenqing Wang, Yong Hu +4

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Ziqi Zhu +32w ago

GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking

LLMs get schooled in dialogue state tracking by a mixture-of-experts architecture that uses a graph neural network and ReAct agents to achieve state-of-the-art results with a T5-Small backbone.

Ziqi Zhu, Adithya Suresh, Tomal Deb +1

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Tool Use & Agents

2w ago

Pen-Strategist: A Reasoning Framework for Penetration Testing Strategy Formation and Analysis

LLMs can now formulate significantly better penetration testing strategies, outperforming even GPT-5, thanks to a novel reasoning framework and targeted fine-tuning.

Yasod Ginige, Pasindu Marasinghe, Sajal Jain +1

Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness Tool Use & Agents

Johannes Hartel2w ago

Agentic Repository Mining: A Multi-Task Evaluation

LLM agents that autonomously explore code repositories can match the classification accuracy of simpler LLMs with hand-crafted context, hinting at a future where agents surpass human-labeled data in complex software understanding tasks.

Johannes Hartel

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

The Open University2w ago

Toward an Understanding of Developer Behaviour while Using Bug Localization Tools

Bug localization tool adoption hinges on more than just accuracy: developers need tools that mesh with their workflows and leverage contextual information.

Pablo Diaz Pedreira, Tamara Lopez, Michel Wermelinger

Code Generation & Program Synthesis Tool Use & Agents

Junhao Ye +92w ago

UVMarvel: an Automated LLM-aided UVM Machine for Subsystem-level RTL Verification

Automating UVM testbench generation with LLMs slashes verification time from days to hours, achieving near-complete code coverage.

Junhao Ye, Dingrong Pan, Hanyuan Liu +7

Code Generation & Program Synthesis Tool Use & Agents

BaseThesis Labs2w ago·also QwikBuild

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

"Vibe coding" platforms promise effortless app creation, but SWE-WebDevBench reveals they often deliver visually appealing frontends with broken backends, struggle with security, and require significant human effort to reach production readiness.

Siddhant Saxena, Nilesh Trivedi, V. Jyothi

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

School of Computer Science2w ago·also Hubei Key Laboratory of Multimedia and Network, Institute of Artificial Intelligence, National Engineering Research Center for Multimedia, WHU

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

Video-LLMs are leaving performance on the table: explicitly anchoring to keyframes before answering questions unlocks significant gains in Video TextVQA.

Haibin He, Maoyuan Ye, Juhua Liu +1

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Xinpan Meng +82w ago

From Reach to Insert: Tactile-Augmented Precision Assembly under Sub-Millimeter Tolerances

Tactile feedback, when strategically sampled and evaluated, unlocks robust and safe robotic insertion policies even under sub-millimeter tolerances.

Xinpan Meng, Siyao Huang, JingPu Yang +6

Robotics & Embodied AI Tool Use & Agents

Joshua H. Davis +72w ago

KEET: Explaining Performance of GPU Kernels Using LLM Agents

Stop squinting at Nsight Compute profiles: KEET uses LLMs to automatically diagnose GPU kernel bottlenecks and suggest optimizations in plain English.

Joshua H. Davis, Klaudiusz Rydzy, S. Ramesh +5

Code Generation & Program Synthesis Interpretability & Mechanistic Interp Tool Use & Agents

Guy Damari +62w ago

AI-Aided Advancements in Autonomous Underwater Vehicle Navigation

AI is enabling a new generation of AUV navigation systems that overcome the limitations of traditional model-based approaches in complex underwater environments.

Guy Damari, Zeev Yampolsky, Nadav Cohen +4

Robotics & Embodied AI Tool Use & Agents

May 5, 2026

Dongyoung Kim +672w ago

RLDX-1 Technical Report

RLDX-1 achieves double the success rate of existing VLAs on complex humanoid tasks, suggesting a leap in robots' ability to handle contact-rich, dynamic manipulation.

Dongyoung Kim, Huiwon Jang, Myungkyu Koo +65

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Yilun Zhao +52w ago

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

Standard retriever evaluations hide critical weaknesses in agentic search systems, but a new benchmark and training method exposes and addresses these flaws.

Yilun Zhao, Jinbiao Wei, Tingyu Song +3

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Zirui Tang +192w ago

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Today's AI agents are surprisingly inept at navigating the messy reality of digital workspaces, failing to reach even 70% accuracy on tasks that require understanding file dependencies.

Zirui Tang, Xuanhe Zhou, Yumou Liu +17

Eval Frameworks & Benchmarks Tool Use & Agents

2w ago

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Forget resource-intensive pipelines: a purely academic team achieves SOTA search agent performance with just 10.6k SFT data points, outperforming models trained with CPT+SFT+RL.

Yuwen Du, Rui Ye, Shuo Tang +4

Eval Frameworks & Benchmarks Open-Source Models & Weights Tool Use & Agents

Joseph Breda +322w ago

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

LLMs beat doctors at everyday symptom diagnosis, but only when they proactively interview patients instead of passively answering questions.

Joseph Breda, Fadi Yousif, Beszel Hawkins +30

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

2w ago·also HKU, Rice

Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

A hierarchical agent that separates visual and textual contexts drastically improves multi-step reasoning on complex charts, outperforming monolithic MLLMs.

Qihua Dong, Ruozhen He, Junwen Chen +4

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Achuth Chandrasekhar +32w ago

Material Database Agent: A Multimodal Agentic Framework for Scientific Literature Mining

Automating materials science database construction is now feasible: a multi-agent system extracts structured data from scientific literature with high speed and accuracy.

Achuth Chandrasekhar, Omid Barati Farimani, Radheesh Sharma Meda +1

Multimodal Models Scientific Discovery & Drug Design Tool Use & Agents

Tianyang Han +102w ago

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

Stop rewarding reasoning that just looks good – reward reasoning that actually *helps* the downstream model solve the task.

Tianyang Han, Tianyang Han, Hengyu Shi +8

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Cherkasy State Business College2w ago

TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains

Separating LLMs into a deliberate validation layer, rather than making them an architectural default, can improve trustworthiness and efficiency in agentic AI systems.

Serhii W. Zabolotnii

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Microsoft Research2w ago

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

Forget human-readable models: Agentic-imodels evolves ML models that are optimized for LLM interpretability, boosting agentic data science performance by up to 73%.

Chandan Singh, Y. Tan, Weijia Xu +4

Interpretability & Mechanistic Interp Tool Use & Agents

2w ago·also Google Research, Harvard, Northeastern, Notre Dame +2

Deco: Extending Personal Physical Objects into Pervasive AI Companion through a Dual-Embodiment Framework

Instead of creating new AI companions from scratch, Deco shows how to breathe new life into cherished physical objects by giving them a digital voice and personality powered by LLMs.

Zhihan Jiang, Meng Wu, Ruishi Zou +14

Natural Language Processing Robotics & Embodied AI Tool Use & Agents

Maxim Chupilkin2w ago

Multi-Agent Strategic Games with LLMs

LLMs playing international relations games reveal that they're not just regurgitating training data, but actually reasoning strategically like humans—and even unraveling under pressure.

Maxim Chupilkin

Natural Language Processing Tool Use & Agents

2w ago

Attention: What Prevents Young Adults from Speaking Up Against Cyberbullying in an LLM-Powered Social Media Simulation

LLM-powered simulations can train cyberbullying intervention, but only after users overcome key attention deficits that prevent them from recognizing the need for public action.

Qian Yang, Jessie Jia, Elaine Tsai +3

Natural Language Processing Tool Use & Agents World Models & Planning

Raja Sekhar Rao Dheekonda +22w ago

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Forget weeks of manual scripting: this AI red teaming agent lets you launch sophisticated attacks with natural language, slashing vulnerability discovery time.

Raja Sekhar Rao Dheekonda, William W. Pearce, N. Landers

Red-Teaming & Adversarial Robustness Tool Use & Agents

BAIR2w ago

MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents

Retrieval-augmented LLMs are surprisingly vulnerable to memory poisoning via synonym substitution, a loophole that gradient-based defenses can't close.

Ishrith Gowda

Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago

ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection

Existing defenses crumble when LLM agents face prompt injections that adapt to dynamic context, but ARGUS offers a robust solution by tracking the provenance of agent decisions.

Shihao Weng, Yang Feng, Jinrui Zhang +3

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

C. Soares +52w ago

AI Advocate: Educational Path to Transform Squads to the Future

Upskilling internal "AI Advocates" can be a surprisingly effective catalyst for driving cultural and technical transformation in software development squads.

C. Soares, G. Moreira, Ana Paula Camargo +3

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

2w ago

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

LLM agent skills are needlessly brittle and insecure: SkCC compiles them into a portable, hardened format that boosts performance by 50% and proactively blocks attacks.

Yipeng Ouyang, Yingjiao Xiao, Yuhao Gu +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

S. Vigraham2w ago

When Context Hurts: The Crossover Effect of Knowledge Transfer on Multi-Agent Design Exploration

Sometimes, giving an agent more information actually *hurts* its ability to solve a problem, especially when its default behavior is already pretty good.

S. Vigraham

Code Generation & Program Synthesis Tool Use & Agents

2w ago

Exploring the Output of Software Testing Tools through a Visual Comparative Analysis

Software testing tools share surprisingly consistent visual patterns, offering a blueprint for designing more intuitive and informative testing interfaces.

Brandon Lit, Anthony Maocheia-Ricci, Thomas Driscoll

Code Generation & Program Synthesis Tool Use & Agents

Yazan Youssef +22w ago

ARMATA: Auto-Regressive Multi-Agent Task Assignment

End-to-end learning can beat even the best industrial solvers at multi-agent task assignment, improving solution quality by 20% while slashing computation time from hours to seconds.

Yazan Youssef, A. Noureldin, S. Givigi

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Zhiling Chen +52w ago

Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision Language Embeddings and Hyperdimensional Computing

Forget tedious manual tuning: ScanHD lets robots autonomously configure laser profilers based on natural language instructions and visual context, achieving >92% accuracy in real-world inspection tasks.

Zhiling Chen, David J. Gorsich, Matthew P. Castanier +3

Computer Vision Robotics & Embodied AI Tool Use & Agents

Andrea Iannoli +42w ago

Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones

LLMs alone can't reliably fly drone swarms from natural language commands; task-specific tools and runtime guardrails are essential for real-world cyber-physical system control.

Andrea Iannoli, Lorenzo Gigli, L. Sciullo +2

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

Ho Jae Lee +52w ago

Learning Reactive Dexterous Grasping via Hierarchical Task-Space RL Planning and Joint-Space QP Control

Reactive dexterous grasping can be achieved with zero-shot transfer to real-world objects by decoupling high-level RL planning from low-level QP control, enabling dynamic adjustments to safety margins without retraining.

Ho Jae Lee, Yonghyeon Lee, Alexander Alexiev +3

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Shinas Shaji +32w ago

Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior

LLMs spontaneously exhibit collaborative behaviors like perspective-taking and theory of mind in embodied settings, suggesting a surprising capacity for modeling human collaborators without explicit training.

Shinas Shaji, Teena Hassan, Sebastian Houben +1

Eval Frameworks & Benchmarks Robotics & Embodied AI Tool Use & Agents

Yibang Tang +42w ago

SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems

Achieve 15% faster order completion in warehouse robotics with a new deep reinforcement learning approach that jointly optimizes robot scheduling and order allocation in real-time.

Yibang Tang, Yifan Yang, Jingyuan Wang +2

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Stefan Fischer +22w ago

phys-MCP: A Control Plane for Heterogeneous Physical Neural Networks

Control heterogeneous physical neural networks—even wetware—with a single orchestration architecture, opening the door to seamless integration with edge-cloud workflows.

Stefan Fischer, Malihe Hariri, Sebastian Otte

Distributed Systems & Hardware Robotics & Embodied AI Tool Use & Agents

Apoorva Mathur2w ago

Thinking fast and slow -- decision intelligence for power systems

Future power grids can learn from human cognition and octopus intelligence to build more robust and responsive decision-making systems.

Apoorva Mathur

Tool Use & Agents World Models & Planning

Sofiene Khiari +22w ago

AgenticPosesRanker: An Agentic AI Framework for Physically Grounded Ranking of Protein-Ligand Docking Poses

GPT-5, combined with physics-based tools, can match traditional scoring functions in ranking protein-ligand docking poses, opening avenues for interpretable curation in drug design.

Sofiene Khiari, Amr H. Mahmoud, Markus A. Lill

Recommendation & Information Retrieval Scientific Discovery & Drug Design Tool Use & Agents

Danny Hoang +72w ago

Physics-Grounded Multi-Agent Architecture for Traceable, Risk-Aware Human-AI Decision Support in Manufacturing

LLMs can't reliably orchestrate multi-step manufacturing workflows, but this physics-grounded multi-agent system can, boosting tool execution success by 87.5% while ensuring traceable, risk-aware decisions.

Danny Hoang, Ryan Matthiessen, Chris Miller +5

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

University of São Paulo2w ago·also University of Brasília

Operationalizing Software Engineering Theories for Practical Validation

Grounding software engineering theories in empirical evidence just got easier: this paper offers a systematic, replicable procedure for translating abstract concepts into testable hypotheses.

Isaque Alves, Fabio Kon, Jessica Díaz +1

Code Generation & Program Synthesis Tool Use & Agents

Zoner Oy2w ago·also Helsinki

Multi-Agent Systems for Root Cause Analysis in Microservices

LLMs can now collaboratively pinpoint root causes in microservices using a tree-structured search, but production environments reveal the limitations of this approach when faced with polyglot stacks and inconsistent logging.

Alexander Naakka, Yuqing Wang, M. Mantyla

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

UW2w ago

ProgramBench: Can Language Models Rebuild Programs From Scratch?

LLMs can't rebuild software from scratch, even for widely used programs like FFmpeg and SQLite, revealing a critical gap in their ability to make high-level software architecture decisions.

John Yang, K. Lieret, J. Ma +9

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Ahmed F. Ibrahim2w ago

A Multi-Agent Consensus Protocol for Stable Software Remodularization

Guaranteeing software stability during remodularization doesn't require sacrificing performance; a multi-agent consensus protocol can match state-of-the-art optimizers while acting as a "circuit breaker" for strict stability constraints.

Ahmed F. Ibrahim

Code Generation & Program Synthesis Distributed Systems & Hardware Tool Use & Agents

May 4, 2026

2w ago

AcademiClaw: When Students Set Challenges for AI Agents

Today's best AI agents can only solve 55% of real-world academic tasks that university students find challenging, revealing a significant gap between current AI capabilities and the demands of academic workflows.

Junjie Yu, Pengrui Lu, Weiye Si +75

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Stanford HAI2w ago

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Current LLM agents are woefully inadequate for real-world clinical tasks, achieving only 46% success on a new benchmark that demands long-horizon reasoning and verifiable execution within electronic health records.

Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler +10

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Haixin Wang +82w ago·also HKU

T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Multi-turn RL agents can learn far more effectively by explicitly monitoring and controlling uncertainty at both the token and turn levels, leading to more stable training and higher performance.

Haixin Wang, Hejie Cui, Chenwei Zhang +6

RLHF & Preference Learning Tool Use & Agents Training Efficiency & Optimization

2w ago·also USC

(POSTER) From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

Slash sensor application development time from weeks to days by leveraging AI-assisted pattern reuse for intent-driven workflow design.

Komal Thareja, Anirban Mandal, Ewa Deelman

Distributed Systems & Hardware Scientific Discovery & Drug Design Tool Use & Agents

2w ago·also Princeton, Rutgers

AAFLOW: Scalable Patterns for Agentic AI Workflows

Agentic workflows can be sped up by 4.6x, not through faster LLMs, but by optimizing data flow and communication between components.

Arup Kumar Sarker, Mills Staylor, Aymen Alsaadi +3

Distributed Systems & Hardware Tool Use & Agents

2w ago

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

Autonomous agents can produce plausible-sounding research that's subtly wrong, so ARIS uses adversarial collaboration between different LLMs to catch these errors.

Ruofeng Yang, Yongcan Li, Shuai Li

Eval Frameworks & Benchmarks Open-Source Models & Weights Tool Use & Agents

Chenchen Zhang2w ago

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

Turns out, nobody's explicitly RL-training LLM agents when to *stop* in multi-agent systems, despite its critical role in efficiency and cost.

Chenchen Zhang

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Jianing Wang +102w ago

HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

Forget brittle orchestration layers – LLMs can internalize complex reasoning as a learnable "HeavySkill" that rivals external agentic frameworks.

Jianing Wang, Linsen Guo, Zhengyu Chen +8

Reasoning & Chain-of-Thought Tool Use & Agents

May 2, 2026

Daoxuan Zhang +32w ago

ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue

Current MLLM-driven UAV agents still struggle with spatial memory and aerial adaptation when tasked with autonomously exploring and reasoning about victim locations in realistic search and rescue scenarios.

Daoxuan Zhang, Ping Chen, Jianyi Zhou +1

Eval Frameworks & Benchmarks Robotics & Embodied AI Tool Use & Agents

Siqi Zhu2w ago

Agentic AI Systems Should Be Designed as Marginal Token Allocators

Treating agentic AI systems as token economies reveals that current designs, which optimize token usage locally, lead to predictable global misallocations and inefficiencies.

Siqi Zhu

Code Generation & Program Synthesis Tool Use & Agents

May 1, 2026

Chengshuai Shi +123w ago

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Forget short-horizon RL: Odysseus proves VLMs can master 100+ turn decision-making in complex games, outperforming state-of-the-art models by 3x.

Chengshuai Shi, Wenzhe Li, Xin Liang +10

Multimodal Models RLHF & Preference Learning Tool Use & Agents

Zijian Qin +43w ago

Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization

LLMs can now intelligently orchestrate multi-agent systems, learning to optimize both individual agent actions and inter-agent cooperation for distributed black-box problems.

Zijian Qin, Zi-Bo Qin, Feng-Feng Wei +2

Distributed Systems & Hardware Robotics & Embodied AI Tool Use & Agents

Minbyul Jeong3w ago

Healthcare AI GYM for Medical Agents

Multi-turn medical AI agents trained with RL tend to collapse into verbose, single-turn monologues, but a novel self-distillation method can restore multi-turn tool use and improve performance.

Minbyul Jeong

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Tool Use & Agents