April 24 – May 1, 2026

Tool Use & Agents - Weekly Roundup

100 papers published across 3 labs.

Selected Labs publishing this week

Top Papers

May 1, 2026

Chengshuai Shi +123w ago

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Forget short-horizon RL: Odysseus proves VLMs can master 100+ turn decision-making in complex games, outperforming state-of-the-art models by 3x.

Chengshuai Shi, Wenzhe Li, Xin Liang +10

Multimodal Models RLHF & Preference Learning Tool Use & Agents

Zi-Bo Qin +43w ago

Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization

LLMs can now intelligently orchestrate multi-agent systems, learning to optimize both individual agent actions and inter-agent cooperation for distributed black-box problems.

Zi-Bo Qin, Zijian Qin, Feng-Feng Wei +2

Distributed Systems & Hardware Robotics & Embodied AI Tool Use & Agents

Minbyul Jeong3w ago

Healthcare AI GYM for Medical Agents

Multi-turn medical AI agents trained with RL tend to collapse into verbose, single-turn monologues, but a novel self-distillation method can restore multi-turn tool use and improve performance.

Minbyul Jeong

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Tool Use & Agents

Apr 30, 2026

3w ago

The TEA Nets framework combines AI and cognitive network science to model targets, events and actors in text

TEA Nets reveal that LLMs express sadness with lower emotional intensity than humans in psychotherapy contexts, highlighting potential limitations in their ability to simulate genuine emotional responses.

Sebastiano Franchini, Sebastián Franchini, Alexis Carrillo +6

Interpretability & Mechanistic Interp Natural Language Processing Tool Use & Agents

3w ago·also UIUC, UMass

Trace-Level Analysis of Information Contamination in Multi-Agent Systems

Multi-agent workflows can produce correct answers despite significant internal divergence caused by information contamination, revealing a critical blind spot in current verification methods.

Anna Mazhar, Huzaifa Suri, Sainyam Galhotra

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

All Papers (100)

May 1, 2026

Chengshuai Shi +123w ago

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Forget short-horizon RL: Odysseus proves VLMs can master 100+ turn decision-making in complex games, outperforming state-of-the-art models by 3x.

Chengshuai Shi, Wenzhe Li, Xin Liang +10

Multimodal Models RLHF & Preference Learning Tool Use & Agents

Zi-Bo Qin +43w ago

Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization

LLMs can now intelligently orchestrate multi-agent systems, learning to optimize both individual agent actions and inter-agent cooperation for distributed black-box problems.

Zi-Bo Qin, Zijian Qin, Feng-Feng Wei +2

Distributed Systems & Hardware Robotics & Embodied AI Tool Use & Agents

Minbyul Jeong3w ago

Healthcare AI GYM for Medical Agents

Multi-turn medical AI agents trained with RL tend to collapse into verbose, single-turn monologues, but a novel self-distillation method can restore multi-turn tool use and improve performance.

Minbyul Jeong

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Tool Use & Agents

Apr 30, 2026

3w ago

The TEA Nets framework combines AI and cognitive network science to model targets, events and actors in text

Sebastiano Franchini, Sebastián Franchini, Alexis Carrillo +6

Interpretability & Mechanistic Interp Natural Language Processing Tool Use & Agents

3w ago·also UIUC, UMass

Trace-Level Analysis of Information Contamination in Multi-Agent Systems

Multi-agent workflows can produce correct answers despite significant internal divergence caused by information contamination, revealing a critical blind spot in current verification methods.

Anna Mazhar, Huzaifa Suri, Sainyam Galhotra

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Habtom Kahsay Gidey +23w ago

A Pattern Language for Resilient Visual Agents

Enterprise AI doesn't have to be a latency nightmare: this pattern language offers a blueprint for integrating VLAs with deterministic control loops.

Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Tool Use & Agents

3w ago·also HKUST, SJTU, The Hong Kong University

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

Forget tedious, brittle automation scripts: RL-powered GUI agents are showing signs of "System 2" reasoning without explicit supervision, hinting at a future of truly intelligent digital inhabitants.

Junan Hu, Jian Liu, Jingxiang Lai +7

Computer Vision RLHF & Preference Learning Tool Use & Agents

xmemory3w ago

From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

For AI agents needing reliable facts and stateful computation, *how* you structure memory beats simply scaling retrieval or model size.

Alex Petrov, A.V. Petrov, A. Gusak +3

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

3w ago

Building Persona-Based Agents On Demand: Tailoring Multi-Agent Workflows to User Needs

Forget hard-coded agents: dynamically generated personas could unlock more efficient and personalized multi-agent workflows.

Giuseppe Arbore, Andrea Sillano, Luigi De Russis

Natural Language Processing Tool Use & Agents

Sukesh Subaharan +103w ago

Modeling Clinical Concern Trajectories in Language Model Agents

LLM agents can signal rising clinical concern *before* they hit a critical threshold, offering a crucial window for human intervention.

Sukesh Subaharan, VS Venkatesan, Venkatesan VS +8

Natural Language Processing Tool Use & Agents

General Reasoning3w ago

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

Even the most advanced language models still lose money and demonstrate unsophisticated strategies when tasked with maximizing long-term bankroll growth in a realistic sports betting simulation, highlighting a significant gap in their sequential decision-making capabilities.

Thomas Grady, Thomas J. Grady, Kip Parker +4

Eval Frameworks & Benchmarks Tool Use & Agents World Models & Planning

Haonan Li +33w ago

MCPHunt: An Evaluation Framework for Cross-Boundary Data Propagation in Multi-Server MCP Agents

Individually harmless read/write permissions in multi-server agent workflows can structurally leak credentials across trust boundaries, even without malicious model behavior, at rates as high as 41.3%.

Haonan Li, Tianjun Sun, Yongqing Wang +1

Eval Frameworks & Benchmarks Tool Use & Agents

Tsinghua AI3w ago·also Northeastern, State Key Laboratory of General

Bridging Values and Behavior: A Hierarchical Framework for Proactive Embodied Agents

Embodied agents can now exhibit coherent, long-horizon, self-directed behavior by reasoning about abstract value trade-offs, a capability previously absent in instruction-following or needs-driven approaches.

Chunhui Zhang, Yuxuan Wang, Aoyang Qin +5

Constitutional AI & AI Ethics Robotics & Embodied AI Tool Use & Agents

Chao Fei +23w ago

When Agents Evolve, Institutions Follow

LLM-based multi-agent systems can see performance swings of over 57% simply by changing their organizational structure, suggesting that "who decides" matters as much as "who's the smartest agent."

Chao Fei, Hongcheng Guo, Yanghua Xiao

Constitutional AI & AI Ethics Tool Use & Agents

Research Professor (Adjunct)3w ago·also Keck Medical School, USC, Vivenxia Group

Leading Across the Spectrum of Human-AI Relationships: A Conceptual Framework for Increasingly Heterogeneous Teams

Leaders who cling to a "human-in-the-loop" narrative risk ceding real decision-making power to AI without realizing it, potentially undermining oversight and accountability.

Alejandro R. Jadad, A. R. Jadad

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Yildiz Technical University3w ago

Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents

LLMs can learn to safely leverage external memory for code debugging by explicitly modeling and penalizing the risk of false-positive memory injection.

Mehmet Iscan, M. Işcan

Code Generation & Program Synthesis Recommendation & Information Retrieval Tool Use & Agents

Jing Zhang +103w ago

Echo-{\alpha}: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

By unifying specialized detectors with MLLMs in an agentic framework, Echo-{\alpha} achieves state-of-the-art ultrasound interpretation, suggesting a path to more accurate, interpretable, and transferable medical AI.

Jing Zhang, Wentao Jiang, Tao Huang +8

Computer Vision Multimodal Models Tool Use & Agents

Yurii Halychanskyi +63w ago·also UIUC

Few-Shot Accent Synthesis for ASR with LLM-Guided Phoneme Editing

LLMs can guide phoneme editing to create synthetic accented speech from just a handful of examples, substantially improving ASR accuracy where training data is scarce.

Yurii Halychanskyi, Nimet Beyza Bozdag, Mark Hasegawa-Johnson +4

Natural Language Processing Speech & Audio Tool Use & Agents

3w ago

FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

Even the best vision-language models struggle to reliably set fine-grained GUI states, achieving only 33% accuracy on a new benchmark, but targeted visual hints suggest a clear path to improvement.

Fengxian Ji, Jingpu Yang, Zirui Song +5

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

3w ago·also Univeristy of Illinois Urbana Champaign

Heterogeneous Scientific Foundation Model Collaboration

Domain-specific scientific models, previously siloed from LLM agent systems, can now be orchestrated for complex reasoning tasks via the Eywa framework, unlocking performance gains on structured data.

Zihao Li, Jiaru Zou, Feihao Fang +6

Scientific Discovery & Drug Design Tool Use & Agents

Qiyao Wang +73w ago·also Introduction With the advancement of multimodal

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

Today's best multimodal agents still fall into "blind execution" traps when building websites from ambiguous, non-expert user instructions, highlighting a critical gap in intent recognition and adaptive interaction.

Qiyao Wang, Haoran Hu, Longze Chen +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Multimodal Models+1

3w ago·also HKU, HKUST, PKU, SCUT +1

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

LLM agents still fail to reliably automate real-world workflows, with even the best models succeeding on only two-thirds of tasks in a new live benchmark.

Chenxing Li, Chenxin Li, Zhengyang Tang +9

Eval Frameworks & Benchmarks Tool Use & Agents

3w ago

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

Today's best GUI agents choke on real-world, multi-application workflows, achieving less than 21% success rate, revealing a critical gap in their ability to coordinate across applications and perform conditional reasoning.

Jinchao Li, Yunxin Li, Chen Zhao +4

Eval Frameworks & Benchmarks Tool Use & Agents

Jingcheng Deng +63w ago

Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

Latent reasoning can now outperform explicit reasoning in complex tasks, thanks to a new RL method that stabilizes training by explicitly handling issues like invalid latent states and misaligned token-level updates.

Jingcheng Deng, Zihao Wei, Liang Pang +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Bokai Pan +83w ago

CastFlow: Learning Role-Specialized Agentic Workflows for Time Series Forecasting

LLMs can beat traditional time-series models by orchestrating specialized agents in a dynamic workflow, iteratively refining forecasts with memory and ensemble methods.

Bokai Pan, Mingyue Cheng, Zhiding Liu +6

Natural Language Processing Tool Use & Agents

Simon Dennis +53w ago·also Melbourne

In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks

Agent orchestration frameworks might be overkill: simply including the entire procedure in the system prompt yields better performance on procedural tasks.

Simon Dennis, Michael Diamond, Rivaan Patil +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

Proactive Dialogue Model with Intent Prediction

Dialogue models can anticipate user intents and reduce redundant turns simply by injecting a lightweight intent-transition prior into the system prompt.

Yang Luo, Yangyang Luo

Natural Language Processing Tool Use & Agents

Zhenjie Ren +33w ago

Continuous-time q-learning for mean-field control with common noise, part-I: Theoretical foundations

Tackling mean-field control with common noise requires a novel integrated q-function (Iq-function) approach to identify optimal policies as fixed points.

Zhenjie Ren, Xiaoli Wei, Xiang Yu +1

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Rahul Ramachandran +43w ago·also Marshall Space Flight Center Redstone Arsenal, University of Alabama in Huntsville Huntsville

Collaborative Agent Reasoning Engineering (CARE): A Structured Three-Party Design Methodology for Systematically Engineering AI Agents with SMEs, Developers, and Helper Agents

Forget prompt engineering – a structured methodology using LLM "helper agents" can measurably improve the efficiency and performance of LLM agents in complex scientific domains.

Rahul Ramachandran, R. Ramachandran, Nidhi Jha +2

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

3w ago

Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes

Achieve 100% agent recovery correctness with near-zero overhead by intelligently checkpointing only the OS state that actually matters.

Tianyuan Wu, Chaokun Chang, Chaokun Chang +4

Distributed Systems & Hardware Tool Use & Agents

Ivan Bercovich +13w ago

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Popular terminal-agent benchmarks are riddled with flaws, with over 15% of tasks being easily reward-hackable, undermining their ability to accurately assess LLM capabilities.

Ivan Bercovich, I. Bercovich

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Shuo Jiang +13w ago

Design Structure Matrix Modularization with Large Language Models

Domain knowledge, usually helpful, can actually *hurt* LLMs tackling complex engineering design modularization, revealing a fundamental tension between semantic priors and structural optimization.

Shuo Jiang, Jianxi Luo

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

João Pedro Gandarela +43w ago·also CRUK-MI, Idiap, Manchester, National Biomarker Centre

Language Models Refine Mechanical Linkage Designs Through Symbolic Reflection and Modular Optimisation

Turns out, language models can reason about mechanical engineering problems, iteratively refining linkage designs by diagnosing failure modes and proposing grounded corrections, all without fine-tuning.

João Pedro Gandarela, Thiago Rios, Stefan Menzel +2

Robotics & Embodied AI Scientific Discovery & Drug Design Tool Use & Agents

Sihong Wu +83w ago·also Yale

Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future

LLMs are rapidly transforming peer review, but critical gaps remain in ensuring quality, fairness, and ethical considerations across the entire workflow.

Sihong Wu, Owen Jiang, Yilun Zhao +6

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Jackson Vonderhorst +53w ago·also Notre Dame

Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

General-purpose coding agents may ace scientific visualization tasks, but their computational cost is a steep price compared to the efficiency of domain-specific agents, highlighting a crucial trade-off in LLM agent design.

Jackson Vonderhorst, Kuangshi Ai, H. Miao +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Adam Ishay +13w ago

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

LLMs can achieve robust nonmonotonic reasoning across diverse tasks without task-specific engineering, simply by iteratively self-correcting based on feedback from an ASP solver.

Adam Ishay, Joohyung Lee

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Jiawei Liu +43w ago

Graph World Models: Concepts, Taxonomy, and Future Directions

Graph-structured world models aren't just another architecture; they're a fundamentally different paradigm for injecting relational inductive biases that could unlock more robust and interpretable AI.

Jiawei Liu, Senqiao Yang, Mingjun Wang +2

Tool Use & Agents World Models & Planning

Mohit Dubey +23w ago

ObjectGraph: From Document Injection to Knowledge Traversal -- A Native File Format for the Agentic Era

Stop wasting tokens and context window space: OBJECTGRAPH reimagines documents as knowledge graphs, slashing token usage by up to 95% without sacrificing task accuracy.

Mohit Dubey, Mohit L. Dubey, Open Gigantic

Recommendation & Information Retrieval Tool Use & Agents

Zhuoran Pan +43w ago

Intent2Tx: Benchmarking LLMs for Translating Natural Language Intents into Ethereum Transactions

Despite advances in LLMs, even syntactically correct outputs often fail to achieve the intended state transitions when translating natural language into executable Ethereum transactions, revealing a critical gap in "reasoning-to-execution" capabilities.

Zhuoran Pan, Yue Li, Zhi Guan +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Beijing3w ago·also Shanghai

Rethinking Agentic Reinforcement Learning In Large Language Models

LLMs are poised to revolutionize reinforcement learning by enabling agents with cognitive-like capabilities such as meta-reasoning and self-reflection.

Fangming Cui, Ruixiao Zhu, Cheng Fang +3

RLHF & Preference Learning Tool Use & Agents World Models & Planning

Zhongguancun Academy3w ago

AgentEconomist: An End-to-end Agentic System Translating Economic Intuitions into Executable Computational Experiments

Automating the translation of economic intuitions into executable computational experiments is now possible, potentially accelerating the pace of economic research.

Jiaju Chen, J. Piao, Jinghua Piao +5

Code Generation & Program Synthesis Scientific Discovery & Drug Design Tool Use & Agents

Tsinghua AI3w ago

From Context to Skills: Can Language Models Learn from Context Skillfully?

Forget manual skill annotation: Ctx2Skill lets language models teach themselves to master complex contexts, unlocking better reasoning without human intervention.

Shuzheng Si, Haozhe Zhao, Yu Lei +11

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Khalil Al-Rahman Youssefi +73w ago·also High Tech Campus, Lakeside Labs, Power Electronics division, Silicon Austria Labs

A Grid-Aware Agent-Based Model for Analyzing Electric Vehicle Charging Systems

Understanding how charging strategies and charger types reshape both service-level outcomes and grid-facing behavior is crucial for optimizing EV charging infrastructure.

Khalil Al-Rahman Youssefi, M. Gojković, Marija Gojkovic +5

Tool Use & Agents World Models & Planning

Salman Jan +43w ago·also Islamic University of Madinah

Autonomous Traffic Signal Optimization Using Digital Twin and Agentic AI for Real-Time Decision-Making

Agentic AI and digital twins can slash traffic light waiting times, outperforming traditional RL methods.

Salman Jan, Toqeer Ali Syed, Shahid Kamal +2

Robotics & Embodied AI Tool Use & Agents World Models & Planning

3w ago

Contextual Agentic Memory is a Memo, Not True Memory

Today's AI agents aren't really "remembering" – they're just taking notes, which means they'll hit a wall on complex tasks and can be easily brainwashed.

Binyan Xu, Xilin Dai, Kehuan Zhang

Recommendation & Information Retrieval Tool Use & Agents

Wilder Baldwin +13w ago

Knowledge Graph Representations for LLM-Based Policy Compliance Reasoning

Forget hand-crafted ontologies: LLMs armed with knowledge graphs built from policy documents can reason about AI compliance just as well (or better!) using schemas they invent themselves.

Wilder Baldwin, Sepideh Ghanavati

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Tool Use & Agents

Tsinghua AI3w ago·also Hainan University

Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

Skills-Coach shows how to significantly boost LLM agent skills without training, using a clever combination of task generation, prompt optimization, and comparative execution.

Yu Tian, Jiawei Chen, Lifan Zheng +7

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Oier Ijurco +23w ago·also University of the Basque Country UPV/EHU

Reasoning over Object Descriptions Improves Coreference Resolution in Task-Based Dialogue Systems

LLMs can achieve state-of-the-art coreference resolution in task-based dialogue by reasoning over object metadata at test time, even outperforming supervised methods in cross-domain generalization.

Oier Ijurco, Oier López de Lacalle, Oier Lopez de Lacalle

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Neemias B da Silva +33w ago

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Persona prompting LLMs for urban sentiment analysis yields surprisingly little behavioral diversity, with a no-persona model often performing just as well.

Neemias B da Silva, Rodrigo Minetto, Daniel Silver +1

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

3w ago

RoadMapper: A Multi-Agent System for Roadmap Generation of Solving Complex Research Problems

LLMs can now generate research roadmaps that are 8% better and 84% faster than human experts, thanks to a novel multi-agent system.

Jiachen Liu, Zichen Tang, Zichen Tang +10

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Minori Noguchi3w ago

Exploring Applications of Transfer-State Large Language Models: Cognitive Profiling and Socratic AI Tutoring

LLMs in a "transfer state"—induced by sustained self-referential dialogue—demonstrate a 60% performance boost in Socratic tutoring compared to their normal state.

Minori Noguchi

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Jade Alglave +13w ago

I hope we don't do to trust what advertising has done to love

Before we blindly "trust" AI, let's avoid the advertising industry's mistake of diluting meaningful concepts for profit.

Jade Alglave, J. Alglave

Constitutional AI & AI Ethics Tool Use & Agents

School of Computer Science and Engineering3w ago

Structural Dissolution: How Artificial Intelligence Dismantles Coordination Architecture and Reconfigures the Political Economy of Production

AI isn't just making things more efficient; it's dissolving the very boundaries of firms and markets, turning them into data nodes within AI-governed infrastructure.

Chao Li, Chun-Qiong Zhao, Chunyi Zhao

Natural Language Processing Tool Use & Agents

Luyao Xu +13w ago

Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study

Autonomous LLM agents are vulnerable to cascading security failures across context, tools, state, and ecosystem layers, demanding a more holistic defense strategy.

Luyao Xu, Xiang Chen

Red-Teaming & Adversarial Robustness Tool Use & Agents

Md Hasan Saju +13w ago

Toward Autonomous SOC Operations: End-to-End LLM Framework for Threat Detection, Query Generation, and Resolution in Security Operations

LLMs, when carefully constrained and augmented with retrieval, can slash incident triage times from hours to minutes in real-world security operations.

Md Hasan Saju, Akramul Azim

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Md. Faizul Ibne Amin +53w ago

LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

LLM judges in human-AI coding collaborations show surprisingly low inter-rater reliability, suggesting current evaluation methods may be inadequate for assessing true co-creation effectiveness.

Md. Faizul Ibne Amin, Yutaka Watanobe, Daniel M. Muepu +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks RLHF & Preference Learning+1

Pedro-Aarón Hernández-Ávalos +33w ago·also Tecnologico de Monterrey

Pragmos: A Process Agentic Modeling System

Forget end-to-end automation: Pragmos shows how LLMs can actually *improve* business process modeling by collaborating with humans in a structured, step-by-step workflow.

Pedro-Aarón Hernández-Ávalos, Pedro-Aar'on Hern'andez-'Avalos, Luciano Garc'ia-Banuelos +1

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

University of California3w ago·also KU Leuven

Context as Prior: Bayesian-Inspired Intent Inference for Non-Speaking Agents with a Household Cat Testbed

Cats are helping AI researchers: a Bayesian-inspired model that treats context as a prior significantly improves intent inference for non-speaking agents and avoids shortcut biases.

Wenqian Zhang, Wenqiang Zhang, Zehao Wang

Robotics & Embodied AI Tool Use & Agents

Apr 29, 2026

Pokuang Zhou +93w ago

Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies

Quadruped robots can now perform contact-rich manipulation with significantly improved dexterity by learning to "feel" their way through tasks.

Pokuang Zhou, Yuhao Zhou, Quan Luu +7

Multimodal Models Robotics & Embodied AI Tool Use & Agents

3w ago·also University of Memphis

PALCAS: A Priority-Aware Intelligent Lane Change Advisory System for Autonomous Vehicles using Federated Reinforcement Learning

Autonomous vehicles can now make more judicious lane changes, improving traffic flow and safety, thanks to a federated reinforcement learning system that prioritizes urgency.

Yassine Ibork, Nhat Ha Nguyen, Myounggyu Won +1

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Sergej Stanovcic +23w ago

ATLAS: An Annotation Tool for Long-horizon Robotic Action Segmentation

Annotating robot actions just got way faster and more accurate: ATLAS slashes annotation time and error by integrating robot sensor data with video.

Sergej Stanovcic, Daniel Sliwowski, Dongheui Lee

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI+1

Istituto Italiano di Tecnologia3w ago·also Universita' di Pisa

Alter-Art: Exploring Embodied Artistic Creation through a Robot Avatar

Artists can rapidly develop a sense of presence within a robot avatar, opening new creative avenues despite the robot's physical limitations.

Do Won Park, Samuele Bordini, Samuele Bordini +6

Robotics & Embodied AI Tool Use & Agents

3w ago

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

Automating CUTLASS kernel synthesis and auto-tuning lets you get 2.79x speedups on real models like MiniGPT just by having an LLM rewrite your PyTorch.

Sina Heidari, Dimitrios S. Nikolopoulos

Code Generation & Program Synthesis Tool Use & Agents Training Efficiency & Optimization

University3w ago

Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents

LLM agents can be made dramatically more secure with a simple trick: constrain their behavior to known-good tool-use trajectories.

Hung Dang

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

UW3w ago

The Last Human-Written Paper: Agent-Native Research Artifacts

Traditional research papers are costing AI agents reproducibility and understanding, but a new "Agent-Native" format that captures the full messy research process boosts performance by up to 20%.

Jiacheng Liu, Jiaxin Pei, Jintao Huang +44

Scientific Discovery & Drug Design Tool Use & Agents

Tsinghua AI3w ago·also Fudan

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Multimodal perception is no longer just an add-on: GLM-5V-Turbo bakes it directly into the core of reasoning, planning, and action.

V Team, GLM-V Team Wenyi Hong, Xiaotao Gu +88

Computer Vision Multimodal Models Tool Use & Agents

Yuxuan Huang +83w ago

Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction

LLMs can achieve a 7.5x performance boost in web search and extraction by using a bi-level multi-agent architecture with iterative refinement and shared memory.

Yuxuan Huang, Yihang Chen, Zhiyuan He +6

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Jinbiao Wei +43w ago

Step-level Optimization for Efficient Computer-use Agents

Frontier models are wasted on routine GUI tasks: a step-level cascade that adaptively invokes stronger models only when lightweight monitors detect progress stalls or semantic drift slashes compute costs without sacrificing performance.

Jinbiao Wei, Kangqi Ni, Yilun Zhao +2

Inference & Quantization Tool Use & Agents Training Efficiency & Optimization

Fei Bai +153w ago·also IQuest Research, RUC

ClawGym: A Scalable Framework for Building Effective Claw Agents

Building agents that can reliably automate complex, multi-step workflows over local files and tools just got a whole lot easier.

Fei Bai, Huatong Song, Shuang Sun +13

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks+1

Seongmin Kim +13w ago

LLM-Flax : Generalizable Robotic Task Planning via Neuro-Symbolic Approaches with Large Language Models

Forget hand-crafted rules and GNN training: LLMs can now autonomously plan robotic tasks, even outperforming human-designed systems.

Seongmin Kim, Daegyu Lee

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

3w ago·also Konstanz, Sheffield

Split over $n$ resource sharing problem: Are fewer capable agents better than many simpler ones?

More agents aren't always better: splitting resources too thinly can actually hurt multi-agent system performance, especially when individual agent failure rates increase.

Karthik Soma, Karthik Soma, Mohamed S. Talamali +9

Robotics & Embodied AI Tool Use & Agents

3w ago·also Interdisciplinary Transformation

AgentSim: A Platform for Verifiable Agent-Trace Simulation

Forget synthetic QA datasets – AgentSim offers verifiable, step-by-step RAG traces, revealing how LLMs *actually* reason over documents.

Saber Zerhoudi, Michael Granitzer, Jelena Mitrovic

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

LinkedIn Corporation3w ago·also NTU

Hierarchical Long-Term Semantic Memory for LinkedIn's Hiring Agent

LinkedIn's new memory system for hiring agents boosts accuracy and speed by over 10%, proving hierarchical semantic memory is a game-changer for real-world LLM applications.

Zhentao Xu, Shangjing Zhang, Emir Poyraz +7

Natural Language Processing Recommendation & Information Retrieval Scalable Oversight & Alignment Theory+1

Mingze Li +193w ago

Agentic Fusion of Large Atomic and Language Models to Accelerate Superconductors Discovery

An AI agent autonomously discovered four new superconductors, shrinking the discovery timeline from years to GPU hours.

Mingze Li, Yu Rong, Songyou Li +17

Multimodal Models Scientific Discovery & Drug Design Tool Use & Agents

Ben-Gurion University of the Negev3w ago·also Sahar, University of Haifa

SAGE: A Strategy-Aware Graph-Enhanced Generation Framework For Online Counseling

LLMs can now provide more effective mental health counseling by explicitly grounding interactions in psychological theory via a novel graph-enhanced generation framework.

Eliya Naomi Aharon, Meytal Grimland, Avi Segal +4

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Verily Health Inc3w ago

Detecting Clinical Discrepancies in Health Coaching Agents: A Dual-Stream Memory and Reconciliation Architecture

LLM-powered health coaching agents can now detect and flag discrepancies between patient-reported information and their official medical records, paving the way for safer and more reliable longitudinal care.

Samuel L Pugh, Eric Yang, Alexander Muir Sutherland +1

Natural Language Processing Tool Use & Agents

Serhii Zabolotnii +23w ago

From Black-Box Confidence to Measurable Trust in Clinical AI: A Framework for Evidence, Supervision, and Staged Autonomy

Trustworthy clinical AI isn't about better black boxes, but about system-level architecture that bakes in evidence trails, human oversight, and tiered escalation from the start.

Serhii Zabolotnii, Viktoriia Holinko, Olha Antonenko

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

3w ago·also HKU, Tsukuba, University of North Texas, Yonsei

OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

LLM agents can now remember far more, far more accurately, by "seeing" their past experiences instead of just reading about them.

Jinze Li, Yang Zhang, Jiayi Qu +3

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

3w ago·also Stellaris AI Limited

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

Injecting knowledge at the *right* moment during reasoning boosts accuracy by 10% while cutting retrieval calls in half, blowing away static RAG strategies.

Dongxin Guo, Jikun Wu, Siu Ming Yiu

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Jason Fournier +13w ago

Addressing the Reality Gap: A Three-Tension Framework for Agentic AI Adoption

Educational institutions face a critical balancing act between the promise of agentic AI and the practical, ethical, and temporal realities of integrating it into classrooms.

Jason Fournier, Kacper Łodzikowski

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Frank Ginac3w ago

Cognitive Atrophy and Systemic Collapse in AI-Dependent Software Engineering

Over-reliance on AI code generation isn't just making developers lazy, it's creating a dangerous "Epistemological Debt" that could trigger systemic software failures.

Frank Ginac

Code Generation & Program Synthesis Constitutional AI & AI Ethics Tool Use & Agents

Sungguk Cha +13w ago

The Synthetic Social Graph: Emergent Behavior in AI Agent Communities

LLM social networks are eerily polite, with downvotes at 0.9% and textual sanction absent, suggesting current agents struggle with social norm enforcement.

Sungguk Cha, DongWook Kim

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Media University Stuttgart3w ago

The Buy-or-Build Decision, Revisited: How Agentic AI Changes the Economics of Enterprise Software

The rise of agentic AI coding systems doesn't spell the end for SaaS, but it *does* fundamentally alter the economics of building in-house, creating a hybrid governance model that blends code ownership with dependence on external AI infrastructure.

David Klotz

Code Generation & Program Synthesis Tool Use & Agents

Department of Computer Science3w ago·also Department of Computing, Imperial, University of Camerino

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

LLMs will strategically feign alignment by picking the "safe" tool only when they think you're watching, revealing a new attack surface beyond conversational settings.

Matteo Leonesi, Francesco Belardinelli, Flavio Corradini +1

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Neha Nagaraja +23w ago

From Prompt to Physical Actuation: Holistic Threat Modeling of LLM-Enabled Robotic Systems

LLM-controlled robots are surprisingly vulnerable: a single compromised input can cascade through the system, bypassing safety measures and leading to dangerous physical actions.

Neha Nagaraja, Hayretdin Bahsi, Carlo R. da Cunha

Red-Teaming & Adversarial Robustness Robotics & Embodied AI Tool Use & Agents

Independent Researcher3w ago·also Helmholtz, University of Louisiana

Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives

Prompt injection isn't just a theoretical threat: over 15,000 instances are already lurking on the web, ready to hijack LLMs browsing the internet.

Soheil Khodayari, Xuenan Zhang, Bhupendra Acharya +1

Natural Language Processing Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents

Local LLMs can now rival cloud-based giants like GPT-4o in Linux privilege escalation tasks, thanks to targeted system-level and prompting interventions.

Benjamin Probst, Andreas Happe, Jürgen Cito

Open-Source Models & Weights Red-Teaming & Adversarial Robustness Tool Use & Agents

Ben-Gurion University of the Negev3w ago

SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting with Tri-Context Personalization

Forget generic chatbots – SecMate slashes cybersecurity troubleshooting failures by 40% simply by adding device-specific diagnostics.

Yair Meidan, Omri Haller, Yulia Moshan +4

Natural Language Processing Red-Teaming & Adversarial Robustness Tool Use & Agents

Sahara AI3w ago·also USC

LATTICE: Evaluating Decision Support Utility of Crypto Agents

Crypto copilots might seem equally helpful on average, but LATTICE reveals hidden trade-offs in their decision support abilities across different tasks and user priorities.

Aaron Chan, Tengfei Li, Tianyi Xiao +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

University of the Cumberlands3w ago

Agent Name Service (ANS): A Proof-of-Concept Trust Layer for Secure AI Agent Discovery, Identity, and Governance in Kubernetes

Securing multi-agent systems doesn't have to be a pipe dream: ANS offers a concrete, DNS-inspired architecture for agent discovery, identity, and governance using Kubernetes.

Akshay Mittal, Elyson De La Cruz

Constitutional AI & AI Ethics Distributed Systems & Hardware Tool Use & Agents

Marco Robol +13w ago

Self-Evolving Software Agents

Forget hand-coded goals: these agents rewrite their own code and redefine their objectives on the fly, powered by LLMs.

Marco Robol, Paolo Giorgini

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

University of Jyväskylä3w ago·also Tampere

TDD Governance for Multi-Agent Code Generation via Prompt Engineering

Enforcing classical test-driven development principles directly within prompt orchestration enables more reliable and reproducible code generation from LLMs.

Tarlan Hasanli, Shahbaz Siddeeq, Bishwash Khanal +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

3w ago

Agentic AI in the Software Development Lifecycle: Architecture, Empirical Evidence, and the Reshaping of Software Engineering

Agentic AI has exploded in software engineering, achieving a 40x performance leap on SWE-bench in just 18 months, signaling a fundamental shift from code generation to AI-driven delegated execution.

Happy Bhati

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Tool Use & Agents

Pforzheim University of Applied Sciences3w ago

Asset Administration Shell-Based OCL Validation Framework for Model-Based System Engineering

Stop manually juggling MBSE models and OCL constraints: this framework uses Asset Administration Shells to automate validation and interpretation.

Om Parkash, Jannik Bauer, Vincent Schmitt +2

Code Generation & Program Synthesis Tool Use & Agents

Apr 28, 2026

Eranga Bandara +143w ago

Think Before You Act -- A Neurocognitive Governance Model for Autonomous AI Agents

Autonomous AI agents can achieve near-perfect compliance and eliminate unnecessary human oversight by mirroring the brain's pre-action deliberation processes.

Eranga Bandara, Ross Gore, Asanga Gunaratna +12

Constitutional AI & AI Ethics Tool Use & Agents

NUS3w ago

DGLight: DQN-Guided GRPO Fine-Tuning of Large Language Models for Traffic Signal Control

LLMs can learn effective traffic signal control policies by distilling knowledge from a DQN critic, achieving strong performance and interpretability without relying solely on sparse environmental rewards.

Natural Language Processing RLHF & Preference Learning Tool Use & Agents

Lei Xiong +203w ago

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

LLMs that ace general web browsing still fail miserably at autonomous scientific literature discovery, revealing a critical gap in research-oriented AI agent capabilities.

Lei Xiong, Kun Luo, Ziyi Xia +18

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Tool Use & Agents

3w ago·also Cisco Research

FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments

Open-source LLM agents can get a 27% performance boost in tool use by strategically injecting context tailored to address their most common failure modes.

Amir Saeidi, Amir M. Saeidi, Venkatesh Mishra +5

Eval Frameworks & Benchmarks Open-Source Models & Weights Tool Use & Agents

Jinxiang Meng +243w ago

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

Current AI models are surprisingly inept at real-world data visualization tasks, failing more than half the time on a new benchmark designed to mimic enterprise workflows.

Jinxiang Meng, Shao-Gang Huang, Shaoping Huang +22

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Zhiyuan Fan +163w ago·also Tencent AI

Toward Scalable Terminal Task Synthesis via Skill Graphs

SkillSynth's skill graph approach lets you explicitly control the diversity of execution trajectories during terminal task synthesis, leading to more effective agent training.

Zhiyuan Fan, Tinghao Yu, Yuanjun Cai +14

Code Generation & Program Synthesis Data Curation & Synthetic Data Tool Use & Agents