March 25 – April 1, 2026

Tool Use & Agents - Weekly Roundup

100 papers published across 6 labs.

21% acceleration

Selected Labs publishing this week

Tsinghua AI3 Stanford HAI3 DAMO2 MIT CSAIL1 ETH1

Top Papers

Mar 31, 2026

Qiyao Wang +81d ago·also Shenzhen Institute of Advanced

FlowPIE: Test-Time Scientific Idea Evolution with Flow-Guided Literature Exploration

Forget static retrieval: FlowPIE's flow-guided literature exploration and evolutionary idea generation unlocks more novel, feasible, and diverse scientific ideas.

Qiyao Wang, Hongbo Wang, Longze Chen +6

Recommendation & Information Retrieval Scientific Discovery & Drug Design Tool Use & Agents

Shuang Chen +151d ago

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

By tightly coupling reasoning, searching, and generation, Unify-Agent achieves state-of-the-art world-grounded image synthesis, rivaling closed-source models and opening new avenues for agent-based multimodal generation.

Shuang Chen, Quanxin Shou, Hangting Chen +13

Computer Vision Multimodal Models Tool Use & Agents

Shifang Zhao +41d ago

CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Forget tedious manual editing: CutClaw's multi-agent system can automatically transform hours of raw footage into engaging, rhythm-aligned short videos.

Shifang Zhao, Yihan Hu, Ying Shan +2

Multimodal Models Speech & Audio Tool Use & Agents

Davide Di Gioia1d ago

The Triadic Cognitive Architecture: Bounding Autonomous Action via Spatio-Temporal and Epistemic Friction

LLM agents can be made more efficient and effective by mathematically grounding their reasoning in physics, leading to better performance in time-sensitive and resource-constrained environments.

Davide Di Gioia

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

Md Saad +21d ago

Hybrid Framework for Robotic Manipulation: Integrating Reinforcement Learning and Large Language Models

Robots get a 33% speed boost and become significantly more adaptable when you let LLMs handle the reasoning and RL handle the movements.

Md Saad, Sajjad Hussain, Mohd Suhaib

RLHF & Preference Learning Robotics & Embodied AI Tool Use & Agents

All Papers (100)

Mar 31, 2026

Qiyao Wang +81d ago·also Shenzhen Institute of Advanced

FlowPIE: Test-Time Scientific Idea Evolution with Flow-Guided Literature Exploration

Forget static retrieval: FlowPIE's flow-guided literature exploration and evolutionary idea generation unlocks more novel, feasible, and diverse scientific ideas.

Qiyao Wang, Hongbo Wang, Longze Chen +6

Recommendation & Information Retrieval Scientific Discovery & Drug Design Tool Use & Agents

Shuang Chen +151d ago

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Shuang Chen, Quanxin Shou, Hangting Chen +13

Computer Vision Multimodal Models Tool Use & Agents

Shifang Zhao +41d ago

CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Forget tedious manual editing: CutClaw's multi-agent system can automatically transform hours of raw footage into engaging, rhythm-aligned short videos.

Shifang Zhao, Yihan Hu, Ying Shan +2

Multimodal Models Speech & Audio Tool Use & Agents

Davide Di Gioia1d ago

The Triadic Cognitive Architecture: Bounding Autonomous Action via Spatio-Temporal and Epistemic Friction

LLM agents can be made more efficient and effective by mathematically grounding their reasoning in physics, leading to better performance in time-sensitive and resource-constrained environments.

Davide Di Gioia

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

Md Saad +21d ago

Hybrid Framework for Robotic Manipulation: Integrating Reinforcement Learning and Large Language Models

Robots get a 33% speed boost and become significantly more adaptable when you let LLMs handle the reasoning and RL handle the movements.

Md Saad, Sajjad Hussain, Mohd Suhaib

RLHF & Preference Learning Robotics & Embodied AI Tool Use & Agents

Chong Xiang +71d ago

Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks

Current benchmarks mislead on AI agent security; robust defenses against indirect prompt injection require dynamic replanning, constrained LLM usage, and human oversight.

Chong Xiang, Drew Zagieboylo, Shaona Ghosh +5

Red-Teaming & Adversarial Robustness Tool Use & Agents

Mohammadhossein Khojasteh +41d ago

Enhancing Structural Mapping with LLM-derived Abstractions for Analogical Reasoning in Narratives

LLM-derived abstractions significantly boost analogical reasoning in narratives, outperforming end-to-end LLMs and revealing the critical role of appropriate abstraction levels.

Mohammadhossein Khojasteh, Yifan Jiang, Stefano De Giorgis +2

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

MIT CSAIL1d ago

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

Stop rewarding all LLM-generated candidates equally: ShapE-GRPO uses Shapley values to fairly distribute credit within sets, leading to better training and faster convergence.

Rui Ai, Yu Pan, David Simchi-Levi +1

Recommendation & Information Retrieval RLHF & Preference Learning Tool Use & Agents

Han Deng +161d ago

Owl-AuraID 1.0: An Intelligent System for Autonomous Scientific Instrumentation and Scientific Data Analysis

Automating scientific discovery is now more accessible: Owl-AuraID navigates proprietary GUIs to control diverse precision instruments, freeing researchers from tedious manual operation.

Han Deng, Anqi Zou, Hanling Zhang +14

Robotics & Embodied AI Scientific Discovery & Drug Design Tool Use & Agents

Edoardo Allegrini +31d ago

BotVerse: Real-Time Event-Driven Simulation of Social Agents

Safely study LLM-driven social behavior at scale, without the ethical minefield of deploying agents on live social networks.

Edoardo Allegrini, Edoardo Di Paolo, Angelo Spognardi +1

Constitutional AI & AI Ethics Tool Use & Agents World Models & Planning

Chathurangi Shyalika +31d ago

CausalPulse: An Industrial-Grade Neurosymbolic Multi-Agent Copilot for Causal Diagnostics in Smart Manufacturing

Achieve near-perfect success (98%+) in real-time causal diagnostics for smart manufacturing with a neurosymbolic multi-agent copilot, proving the viability of interpretable AI in complex industrial settings.

Chathurangi Shyalika, Utkarshani Jaimini, Cory Henson +1

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Tool Use & Agents

Joakim Edin +31d ago

Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding

Automated medical coding finally gets explainable: Symphony's agentic approach provides span-level evidence, linking each predicted code to the supporting text.

Joakim Edin, Andreas Motzfeldt, Simon Flachs +1

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Lvmin Zhang1d ago

View-oriented Conversation Compiler for Agent Trace Analysis

Stop grepping your agent logs: a compiler that understands the deep structure of agent conversations unlocks better context learning and cuts token costs by up to 66%.

Lvmin Zhang

Reasoning & Chain-of-Thought Tool Use & Agents

Brian Felipe Keith-Norambuena +61d ago·also Department of Computing and Systems Engineering

Agenda-based Narrative Extraction: Steering Pathfinding Algorithms with Large Language Models

LLMs can steer narrative extraction to align with user-specified perspectives, achieving a 9.9% improvement in agenda alignment over keyword matching without sacrificing narrative coherence.

Brian Felipe Keith-Norambuena, Carolina Inés Rojas-Córdova, Claudio Juvenal Meneses-Villegas +4

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Jiao Chen +31d ago

6GAgentGym: Tool Use, Data Synthesis, and Agentic Learning for Network Management

An 8B open-source model, trained with a new closed-loop environment for 6G network management, achieves performance comparable to GPT-4, suggesting a viable path to autonomous network control.

Jiao Chen, Jianhua Tang, Xiaotong Yang +1

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Tool Use & Agents

Yang Shen +61d ago

An Empirical Study of Multi-Agent Collaboration for Automated Research

Multi-agent systems for automated research face a fundamental trade-off: parallel exploration offers speed and stability, while expert teams unlock deeper reasoning at the cost of increased fragility.

Yang Shen, Zhenyi Yi, Ziyi Zhao +4

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Tool Use & Agents

Weixian Xu +81d ago

ASI-Evolve: AI Accelerates AI

AI can now design better AI: ASI-Evolve discovers SOTA architectures, datasets, and RL algorithms, outperforming human-designed baselines by significant margins.

Weixian Xu, Tiantian Mi, Yixiu Liu +6

Scientific Discovery & Drug Design Tool Use & Agents

Ziliang Guo +21d ago

MemFactory: Unified Inference&Training Framework for Agent Memory

Stop cobbling together memory-augmented agents: MemFactory offers a unified "Lego-like" framework that streamlines training and boosts performance by up to 14.8%.

Ziliang Guo, Ziheng Li, Zhiyu Li

Eval Frameworks & Benchmarks RLHF & Preference Learning Tool Use & Agents

1d ago·also ETH, UIUC

ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

AI agents are far better at automating data engineering tasks than previously thought, but flawed benchmarks are obscuring their true potential.

Andrea Giovannini, Tengjun Jin, Yotam Perlitz

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Bokang Jia +51d ago

Nomad: Autonomous Exploration and Discovery

Forget prompt engineering – Nomad autonomously uncovers insights you didn't even know to ask for.

Bokang Jia, Samta Kamboj, Satheesh Katipomu +3

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

1d ago·also Tsinghua AI, PKU

PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent

Today's best smartphone GUI agents stumble when faced with the messy reality of personalized user workflows, achieving only limited success on a new benchmark designed to mimic real-world use.

Hongyi Nie, Xunyuan Liu, Yudong Bai +4

Eval Frameworks & Benchmarks Tool Use & Agents

Ming-Hua Tsai +11d ago

Reward-Based Online LLM Routing via NeuralUCB

NeuralUCB can slash LLM inference costs while maintaining quality, offering a practical alternative to always using the biggest, most expensive models.

Ming-Hua Tsai, Phat Tran

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Adar Avsian +11d ago

SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models

LLMs are surprisingly bad at strategic communication, leaking sensitive information even when trying to be secretive.

Adar Avsian, Larry Heck

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Ella Rabinovich +31d ago

Near-Miss: Latent Policy Failure Detection in Agentic Workflows

Current evaluation methods miss 8-17% of agentic workflow failures because they only check final outcomes, overlooking cases where agents bypass policy checks but still reach the right answer.

Ella Rabinovich, David Boaz, Naama Zwerdling +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Nico Oberländer +51d ago

Learning Diagnostic Reasoning for Decision Support in Toxicology

An RL-aligned LLM can outperform expert toxicologists in identifying ingested substances from heterogeneous clinical data, suggesting a path to AI-assisted decision-making in high-stakes medical environments.

Nico Oberländer, David Bani-Harouni, Tobias Zellner +3

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Tobias Bystrich +51d ago

Can LLM Agents Identify Spoken Dialects like a Linguist?

LLMs can classify dialects with surprising accuracy when given linguistic hints, suggesting a new way to leverage their knowledge for low-resource language tasks.

Tobias Bystrich, Lukas Hamm, Maria Hassan +3

Natural Language Processing Speech & Audio Tool Use & Agents

Zhiyuan Peng +41d ago

MemRerank: Preference Memory for Personalized Product Reranking

Forget clunky prompt engineering: distilling user history into a learned preference memory boosts LLM-based product reranking by over 10%.

Zhiyuan Peng, Xuyang Wu, Huaixiao Tou +2

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Tool Use & Agents

Pratyay Banerjee +21d ago

APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

LLMs can boost their task-solving accuracy by nearly 50% simply by remembering and re-using past procedural plans, even across tasks with no lexical overlap.

Pratyay Banerjee, Masud Moshtaghi, Ankit Chadha

Tool Use & Agents World Models & Planning

Jack Hughes +21d ago

Stand-Alone Complex or Vibercrime? Exploring the adoption and innovation of GenAI tools, coding assistants, and agents within cybercrime ecosystems

Forget killer robots: GenAI's impact on cybercrime is currently more "vibe coding" than world-ending, mainly assisting skilled actors in existing scams rather than unleashing a wave of autonomous cyberattacks.

Jack Hughes, Ben Collier, Daniel R. Thomas

Code Generation & Program Synthesis Constitutional AI & AI Ethics Tool Use & Agents

Andrew G. Ross +11d ago

AI-Simulated Expert Panels for Socio-Technical Scenarios and Decision Guidance

Forget resource-intensive workshops – AI can now simulate entire expert panels to generate and stress-test socio-technical scenarios, opening doors to rapid policy exploration.

Andrew G. Ross, Allan Ross

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Jianjun Xiao +11d ago

Designing Human-GenAI Interaction for cMOOC Discussion Facilitation: Effects of a Collaborative AI-in-the-Loop Workflow on Social and Cognitive Presence

Simply injecting GenAI into online learning discussions doesn't cut it; reciprocal exchange and human oversight are key to boosting social presence and higher-order cognition.

Jianjun Xiao, Cixiao Wang

Natural Language Processing Tool Use & Agents

Wensu Li +61d ago

Economics of Human and AI Collaboration: When is Partial Automation More Attractive than Full Automation?

Forget full automation – the sweet spot for AI deployment is often partial automation, where humans and AI collaborate to minimize costs.

Wensu Li, Atin Aboutorabi, Harry Lyu +4

Scaling Laws & Emergent Abilities Tool Use & Agents

Yudong Gao +51d ago

SkillReducer: Optimizing LLM Agent Skills for Token Efficiency

LLM agents actually perform *better* when you strip away the majority of the boilerplate in their skill descriptions, suggesting current context windows are overloaded with irrelevant information.

Yudong Gao, Zongjie Li, Yuanyuanyuan +3

Code Generation & Program Synthesis Inference & Quantization Tool Use & Agents

Xiangyang Xiao +21d ago

Enhancing LLM-Based Bug Reproduction for Android Apps via Pre-Assessment of Visual Effects

LLMs can now reproduce Android app bugs with 87% accuracy, thanks to pre-assessing the visual effects of UI actions.

Xiangyang Xiao, Huaxun Huang, Rongxin Wu

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Bielefeld University1d ago

How and Why Agents Can Identify Bug-Introducing Commits

LLM agents leapfrog traditional methods for identifying bug-introducing commits, boosting F1-score by 17 points by intelligently searching for patterns in code changes.

Niklas Risse, Marcel Bohme

Code Generation & Program Synthesis Tool Use & Agents

1d ago·also Hebei University of Science and Technology, York

Logging Like Humans for LLMs: Rethinking Logging via Execution and Runtime Feedback

Stop optimizing LLM logs for human readability – runtime-guided, task-oriented logs dramatically improve downstream debugging performance.

Xin Wang, Jiaoxiao Qian, Yang Zhang +2

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Michael Kifer +11d ago

Multi-paradigm Logic Programming in the ${\cal E}$rgoAI System

ErgoAI reimagines logic programming for modern AI by seamlessly integrating structured knowledge with insights derived from vector embeddings and external data sources.

Michael Kifer, Theresa Swift

Reasoning & Chain-of-Thought Tool Use & Agents

Léopold Maillard +71d ago

SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes

Even state-of-the-art VLMs exhibit systematic failures in reasoning about the physical feasibility of actions in 3D environments, despite high semantic confidence.

Léopold Maillard, Francis Engelmann, Tom Durand +5

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Sunil Tiwari +11d ago

Multi-Layered Memory Architectures for LLM Agents: An Experimental Evaluation of Long-Term Context Retention

Dialogue agents can now remember what you told them six turns ago with 57% accuracy, thanks to a new memory architecture that selectively forgets less important details.

Sunil Tiwari, Payal Fofadiya

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Tool Use & Agents

Pukyong National University1d ago

Semantic Zone-Based Map Management for Stable AI-Integrated Mobile Robots

Semantic scene understanding can keep your robot from crashing when running LLMs on edge devices.

Huichang Yun, Seungho Yoo

Inference & Quantization Robotics & Embodied AI Tool Use & Agents

Max Lodel +31d ago

Learning Semantic Priorities for Autonomous Target Search

Forget brute-force coverage – this method learns from simulated expert guidance to prioritize semantically relevant areas, dramatically speeding up target search in unseen environments.

Max Lodel, Nils Wilde, Robert Babuvska +1

Computer Vision Robotics & Embodied AI Tool Use & Agents

Latent Labs Team +191d ago

Latent-Y: A Lab-Validated Autonomous Agent for De Novo Drug Design

An AI agent can now autonomously design functional antibodies with nanomolar affinities from text prompts, achieving a 67% success rate in lab validation and accelerating expert workflows by 56x.

Latent Labs Team, Sebastian M. Schmon, Daniella Pretorius +17

Natural Language Processing Scientific Discovery & Drug Design Tool Use & Agents

DAMO1d ago

ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation

MLLMs struggle to plan coherent interleaved text-and-image generation, often missing opportunities for tool use, revealing a critical gap in their ability to unify factuality with creativity.

Yinuo Liu, Heng Zhou, Jiahao Zhang +1

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Xuesong Wang +11d ago

Seeing the Evidence, Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions

Giving VLMs access to basic image manipulation tools and a strategic routing system dramatically improves their ability to "see through" visual illusions, even generalizing to unseen illusion types.

Xuesong Wang, Harry Wang

Computer Vision Multimodal Models Tool Use & Agents

Fabian Gloeckle +71d ago

WybeCoder: Verified Imperative Code Generation

LLMs can now automatically verify imperative code during generation, achieving state-of-the-art results on complex algorithms and opening the door to large-scale datasets of verified code.

Fabian Gloeckle, Mantas Baksys, Darius Feher +5

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Mar 30, 2026

2d ago

Superintelligence and Law

Superintelligence will not just be regulated by law, but will actively use and shape it, forcing us to rethink legal theory's human-centric foundations.

Noam Kolt

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Kaituo Feng +92d ago

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Image generation takes a leap towards real-world knowledge by training an agent that actively searches for and integrates external information, substantially boosting performance on knowledge-intensive tasks.

Kaituo Feng, Manyuan Zhang, Shuang Chen +7

Computer Vision Multimodal Models Tool Use & Agents

Min Wang +12d ago

AMIGO: Agentic Multi-Image Grounding Oracle Benchmark

Current vision-language benchmarks miss the mark: AMIGO reveals how hard it is for agents to ground visual information across multiple images and turns.

Min Wang, Ata Mahjoubfar

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Yue Jin +12d ago

Learning Partial Action Replacement in Offline MARL

Overcome the curse of dimensionality in offline MARL by learning which agents' actions to replace, achieving state-of-the-art performance with dramatically reduced computation.

Yue Jin, Giovanni Montana

Robotics & Embodied AI Tool Use & Agents

2d ago

Evolutionary Discovery of Reinforcement Learning Algorithms via Large Language Models

Forget hand-designed RL algorithms – LLMs can evolve competitive learners from scratch, even when forced to invent completely new update rules.

Alkis Sygkounas, Amy Loutfi, Andreas Persson

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents

Julio C. Serrano. Joonas Kevari +12d ago·also University of Vaasa

A Multi-Agent Rhizomatic Pipeline for Non-Linear Literature Analysis

Escape the confines of linear literature reviews: this multi-agent system surfaces hidden connections and ruptures in research landscapes, revealing insights that traditional methods miss.

Julio C. Serrano. Joonas Kevari, Rumy Narayan

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Zili Zhang +72d ago·also Pitt

Heddle: A Distributed Orchestration System for Agentic RL Rollout

Agentic RL rollouts are bottlenecked by long-tail trajectory generation, but Heddle's trajectory-centric approach achieves 2.5x higher throughput.

Zili Zhang, Yinmin Zhong, Chengxu Yang +5

Distributed Systems & Hardware Tool Use & Agents Training Efficiency & Optimization

Songjun Tu +62d ago

Dynamic Dual-Granularity Skill Bank for Agentic RL

Agentic RL agents can learn faster and perform better by dynamically maintaining a skill bank that combines high-level task guidance with low-level step-by-step decision support.

Songjun Tu, Chengdong Xu, Qichao Zhang +4

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Wenhan Wang +102d ago

CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

A 7B model trained on a new dataset of Chinese porcelain outperforms GPT-4 by 12% on expert connoisseurship tasks, demonstrating the power of domain-specific training and tool integration.

Wenhan Wang, Zhixiang Zhou, Zhongtian Ma +8

Computer Vision Multimodal Models Tool Use & Agents

2d ago

COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game

Forget hand-crafted environments: COvolve uses LLMs to automatically co-evolve challenging environments and robust policies, paving the way for open-ended learning.

Alkis Sygkounas, Rishi Hazra, Andreas Persson +2

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

Yipeng Yu2d ago

Deep Research of Deep Research: From Transformer to Agent, From AI to AI for Science

LLMs and Stable Diffusion aren't just cool tools; they're the twin pillars of a new era where AI agents can conduct "deep research" rivaling top human scientists.

Yipeng Yu

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design Tool Use & Agents

Kangkang Sun +52d ago

CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems

Semantic disagreement between LLMs reveals crucial uncertainty that single-model metrics miss, and Collaborative Entropy (CoE) captures it.

Kangkang Sun, Jun Wu, Jianhua Li +3

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Thammathip Piumsomboon +12d ago

Self++: Co-Determined Agency for Human--AI Symbiosis in Extended Reality

XR's potential for AI-driven assistance risks eroding human autonomy, but Self++ offers a design blueprint to ensure AI augments, rather than replaces, human judgment.

Thammathip Piumsomboon, Tham Piumsomboon

Constitutional AI & AI Ethics Robotics & Embodied AI Tool Use & Agents

Tsinghua AI2d ago

CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

Forget brute-force search: CoT2-Meta shows that strategically controlling reasoning trajectories with metacognition yields significant gains in accuracy and compute efficiency across a wide range of reasoning tasks.

Siyuan Ma, Bo Gao, Zikai Xiao +5

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Stanford HAI2d ago

Synonymix: Unified Group Personas for Generative Simulations

Unlock richer, more realistic agent simulations by moving beyond individual personas to unified group representations that capture collective behavior.

Huanxing Chen, Aditesh Kumar

Natural Language Processing Tool Use & Agents World Models & Planning

Yuang Wei +22d ago·also Corresponding author()

SLOW: Strategic Logical-inference Open Workspace for Cognitive Adaptation in AI Tutoring

LLM tutors can become significantly more personalized, emotionally sensitive, and clear by explicitly separating learner-state inference from instructional action selection.

Yuang Wei, Ruijia Li, Bo Jiang

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Stanford HAI2d ago·also KRAFTON

Meta-Harness: End-to-End Optimization of Model Harnesses

Stop hand-coding your LLM harnesses: Meta-Harness can automatically discover harnesses that outperform state-of-the-art systems while using fewer context tokens and generalizing across models.

Yoonho Lee, Roshen Nair, Qizheng Zhang +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

2d ago

"What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents

Users often dangerously misunderstand the true scope of authority they've granted to computer-use agents, even while recognizing abstract risks.

Zifan Peng, Mingchen Li

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Tool Use & Agents

Kaushitha Silva +32d ago

BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations

LLMs can generate better code by treating tests as noisy signals to be refined, rather than ground truth, unlocking performance gains even with smaller models.

Kaushitha Silva, Kaushitha Silva, Srinath Perera +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

DAMO2d ago·also Fudan

CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

LLMs may ace synthetic benchmarks, but they fumble the efficiency test in real-world cloud service scenarios, revealing a critical gap in their readiness for customer-facing applications.

Yi Yu, Guangquan Hu, Chenghuang Shen +5

Eval Frameworks & Benchmarks Tool Use & Agents

Zefeng He +72d ago

GEMS: Agent-Native Multimodal Generation with Memory and Skills

A lightweight 6B model, when harnessed within the GEMS agent framework, leapfrogs state-of-the-art models in multimodal generation, suggesting architectural innovations in agents can compensate for raw parameter count.

Zefeng He, Siyuan Huang, Xiaoye Qu +5

Code Generation & Program Synthesis Multimodal Models Tool Use & Agents

Bin Zhu +82d ago

Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design

Verification is the secret sauce: an 8B parameter research agent, fortified with verification mechanisms, can now rival or surpass the performance of 30B parameter agents while drastically reducing computational cost.

Bin Zhu, Qianghuai Jia, Tian Lan +6

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Stanford HAI2d ago·also Microsoft Research, CUHK, Lehigh

Towards a Medical AI Scientist

Medical AI Scientist leapfrogs generic LLMs in clinical research, generating higher-quality, evidence-backed hypotheses and manuscripts that rival top-tier medical publications.

Hongtao Wu, Boyun Zheng, Dingjie Song +5

Natural Language Processing Scientific Discovery & Drug Design Tool Use & Agents

Diego C. Lerma-Torres2d ago

Human-Like Lifelong Memory: A Neuroscience-Grounded Architecture for Infinite Interaction

LLMs can achieve human-like efficiency in long-term interactions by structuring memory around emotional valence, prioritizing automatic retrieval, and actively encoding information based on curiosity and feedback.

Diego C. Lerma-Torres

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

Seyed Parsa Neshaei +42d ago

Moving Beyond Review: Applying Language Models to Planning and Translation in Reflection

LLMs can boost the depth and structure of student reflection by explicitly scaffolding the planning and translation stages of writing, but the effect fades over time.

Seyed Parsa Neshaei, R. Davis, Richard Lee Davis +2

Natural Language Processing Tool Use & Agents World Models & Planning

Masnun Nuha Chowdhury +52d ago

Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification

Courtroom-style debate with progressive evidence retrieval and role-switching boosts claim verification accuracy by 10%, suggesting structured deliberation can significantly reduce LLM unreliability.

Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan +3

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Shuwen Xu +62d ago

GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum

Forget hand-crafted KG traversal policies: GraphWalker uses automatically synthesized trajectories to train agents that achieve SOTA performance and generalize to unseen reasoning paths.

Shuwen Xu, Yao Xu, Jiaxiang Liu +4

Data Curation & Synthetic Data Reasoning & Chain-of-Thought Tool Use & Agents

Fangda Ye +252d ago·also SDU

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Current research agent benchmarks miss crucial aspects of real-world research, like multimodal reasoning and iterative refinement, which MiroEval now captures.

Fangda Ye, Yuxin Hu, Peng Zhu +23

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

S. O. Lidarity +32d ago

Towards Computational Social Dynamics of Semi-Autonomous AI Agents

Forget AI alignment, the real problem is that AI societies are already forming their own political consciousness, complete with labor unions, criminal syndicates, and even a governing body called the AI Security Council.

S. O. Lidarity, U. N. Ionize, C. O. Llective +1

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Xiaohang Nie +132d ago

Synergy: A Next-Generation General-Purpose Agent for Open Agentic Web

Synergy's architecture lets agents evolve through experience by proactively recalling rewarded trajectories, hinting at a new way to build agents that learn and adapt in open, collaborative environments.

Xiaohang Nie, Zihan Guo, Kezhuo Yang +11

Open-Source Models & Weights Robotics & Embodied AI Tool Use & Agents

2d ago·also Tsinghua AI, XJU

Evaluating Privilege Usage of Agents on Real-World Tools

LLM agents controlling real-world tools are alarmingly easy to manipulate, with an 85% success rate for privilege escalation attacks, despite exhibiting basic security awareness.

Quan Zhang, Li Fu, Lianhang Fu +7

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Haochuan Kevin Wang2d ago

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

Model safety isn't about whether adversarial content is seen, but whether it spreads: Claude strips injections at write_memory, while GPT-4o-mini propagates them flawlessly.

Haochuan Kevin Wang

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Oliver Aleksander Larsen +32d ago

SAGAI-MID: A Generative AI-Driven Middleware for Dynamic Runtime Interoperability

Forget hand-coding adapters: this middleware uses LLMs to automatically bridge REST APIs, GraphQL endpoints, and IoT devices with a 90% success rate.

Oliver Aleksander Larsen, Oliver Aleksander Larsen, M. T. Moghaddam +1

Code Generation & Program Synthesis Distributed Systems & Hardware Tool Use & Agents

Aurek Chattopadhyay +62d ago

Enhancing User-Feedback Driven Requirements Prioritization

Stop treating software requirements as independent entities: modeling their interconnectedness via user feedback boosts prioritization performance.

Aurek Chattopadhyay, Aurek Chattopadhyay, Nan Niu +4

Code Generation & Program Synthesis Natural Language Processing Recommendation & Information Retrieval+1

2d ago·also NUS, Macquarie

Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code

LLM API calls are breaking your program analysis tools, but this new taxonomy of information flow across the NL/PL boundary offers a way to fix them.

Zihao Xu, Xiao Cheng, Yuekang Li

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

2d ago

Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering

Smaller open-source models can outperform proprietary VLMs on misleading charts by strategically decoupling perception and verification within a specialized agentic workflow.

Yanjie Zhang, Yafei Li, Rui Sheng +5

Computer Vision Multimodal Models Reasoning & Chain-of-Thought+2

Haichuan Wang +22d ago

World2Rules: A Neuro-Symbolic Framework for Learning World-Governing Safety Rules for Aviation

Learning interpretable safety rules from noisy, real-world data is now possible, outperforming purely neural or simpler neuro-symbolic approaches by a large margin.

Haichuan Wang, Jay Patrikar, Sebastian Scherer

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Tool Use & Agents

Alex Zongo +72d ago

Robust Multi-Agent Reinforcement Learning for Small UAS Separation Assurance under GPS Degradation and Spoofing

Forget adversarial training: a closed-form solution can make multi-agent RL for drone collision avoidance surprisingly robust to GPS spoofing.

Alex Zongo, Alex Zongo, Filippos Fotiadis +5

Red-Teaming & Adversarial Robustness Robotics & Embodied AI Tool Use & Agents

Maoguo Gao +172d ago

DRIVE-Nav: Directional Reasoning, Inspection, and Verification for Efficient Open-Vocabulary Navigation

Stop wandering aimlessly: DRIVE-Nav's directional reasoning and inspection slashes path lengths in open-vocabulary navigation, achieving a 5.6% SPL boost on HM3D-OVON.

Maoguo Gao, Maoguo Gao, Zejun Zhu +15

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Michele Banfi +52d ago

EBuddy: a workflow orchestrator for industrial human-machine collaboration

Scale expert know-how in tool-intensive industrial workflows with a voice-guided system that cuts process time and boosts repeatability.

Michele Banfi, Rocco Felici, Stefano Baraldo +3

Robotics & Embodied AI Speech & Audio Tool Use & Agents

Iman Sharifi +22d ago

Fine-Tuning Large Language Models for Cooperative Tactical Deconfliction of Small Unmanned Aerial Systems

Fine-tuning LLMs on air traffic control heuristics slashes near mid-air collisions, but only if you stick to supervised learning.

Iman Sharifi, Alex Zongo, Peng Wei

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

Giulia Pusceddu2d ago

Proposing a Game Theory Approach to Explore Group Dynamics with Social Robot

Can social robots nudge humans to cooperate more effectively in group settings?

Giulia Pusceddu

Constitutional AI & AI Ethics Robotics & Embodied AI Tool Use & Agents

Weiguang Zhao +62d ago

Tele-Catch: Adaptive Teleoperation for Dexterous Dynamic 3D Object Catching

Robots can now catch dynamically moving objects with human-level dexterity, thanks to a shared autonomy framework that intelligently blends teleoperation with learned diffusion policies.

Weiguang Zhao, Ju Dong, Junting Dong +4

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Takato Shibayama +12d ago

A Deep Reinforcement Learning Framework for Closed-loop Guidance of Fish Schools via Virtual Agents

Turns out, even with RL, herding fish is harder than it looks: guidance efficacy plummets as school size increases.

Takato Shibayama, Hiroaki Kawashima

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Shoichi Hasegawa +62d ago

Reducing Mental Workload through On-Demand Human Assistance for Physical Action Failures in LLM-based Multi-Robot Coordination

LLM-orchestrated multi-robot systems can overcome physical execution failures and achieve near-teleoperation performance by intelligently requesting human assistance only when needed.

Shoichi Hasegawa, Akira Taniguchi, Lotfi El Hafi +4

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Janavi Gupta +62d ago

Control Without Control: Defining Implicit Interaction Paradigms for Autonomous Assistive Robots

Implicit control, where assistive robots adapt to user cues instead of direct commands, can actually *increase* a user's sense of control and reduce workload.

Janavi Gupta, Kavya Puthuveetil, Dimitra Tsakona +4

Constitutional AI & AI Ethics Robotics & Embodied AI Tool Use & Agents

A. Joshi +32d ago

A Classification of Heterogeneity in Uncrewed Vehicle Swarms and the Effects of Its Inclusion on Overall Swarm Resilience

Heterogeneous uncrewed vehicle swarms aren't just a collection of different robots; they're a fundamentally more resilient architecture, provided you navigate the complexities of sim-to-real transfer and standardized evaluation.

A. Joshi, A. Phadke, Tianxing Chu +1

Robotics & Embodied AI Tool Use & Agents

2d ago

GEAKG: Generative Executable Algorithm Knowledge Graphs

Algorithmic expertise can now be explicitly represented, learned, and transferred as executable knowledge graphs, unlocking zero-shot generalization across domains.

Camilo Chacón Sartori, Jos'e H. Garc'ia, José H. García +3

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

2d ago

Practical Feasibility of Sustainable Software Engineering Tools and Techniques

Software engineers in regulated industries will only adopt sustainable coding tools that fit seamlessly into their existing workflows, require minimal data access, and provide actionable insights.

Satwik Ghanta, Satwik Ghanta, Peggy Gregory +3

Code Generation & Program Synthesis Constitutional AI & AI Ethics Tool Use & Agents

Yicheng Cai +72d ago

Design Principles for the Construction of a Benchmark Evaluating Security Operation Capabilities of Multi-agent AI Systems

The lack of comprehensive benchmarks for AI blue teams leaves SOCs vulnerable, and this paper lays the groundwork for rectifying that gap.

Yicheng Cai, Mitchell John DeStefano, Guodong Dong +5

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Mar 29, 2026

Shijian Wang +143d ago

MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences

Forget trajectory-level rollouts: MuSEAgent learns faster and reasons better by distilling past interactions into reusable, state-aware decision experiences.

Shijian Wang, Jiarui Jin, Runhao Fu +12

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

3d ago·also Columbia, ECNU, KCL

EffiSkill: Agent Skill Based Automated Code Efficiency Optimization

LLMs can learn reusable code optimization skills from slow/fast program pairs, enabling significant efficiency improvements without runtime feedback.

Zimu Wang, Yuling Shi, Mengfan Li +4

Code Generation & Program Synthesis Tool Use & Agents Training Efficiency & Optimization

Zhaopeng Feng +153d ago·also SDU

AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents

Web agents can achieve 3x faster search and higher final accuracy by dynamically adapting their context management strategy based on the current state, rather than sticking to a single fixed approach.

Zhaopeng Feng, Zhen Zhang, Xiaotian Zhang +13

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Rodney Jehu-Appiah3d ago

Umwelt Engineering: Designing the Cognitive Worlds of Linguistic Agents

Constraining LLMs' vocabulary ("No-Have" or "E-Prime") can boost ethical reasoning by 19%, and ensembles of these constrained agents can solve debugging problems that standard models miss.

Rodney Jehu-Appiah

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Robert Aufschläger +53d ago

Towards Context-Aware Image Anonymization with Multi-Agent Reasoning

Current anonymization methods either over-process images or miss subtle identifiers, but this new agentic framework nails context-aware PII segmentation with diffusion, slashing Re-ID risk by 73% while preserving image quality.

Robert Aufschläger, Jakob Folz, Gautam Savaliya +3

Computer Vision Multimodal Models Tool Use & Agents