Tool Use & Agents

Capabilities

LLM-based autonomous agents, tool-augmented language models, function calling, and agentic workflows.

Keywords

tool useAI agentsfunction callingautonomous agentsagentictool augmentedReActagent frameworkMCP

Recent Papers

Feb 12, 2026

2d ago

Choose Your Agent: Tradeoffs in Adopting AI Advisors, Coaches, and Delegates in Multi-Party Negotiation

This paper investigates the impact of different LLM-powered AI assistance modalities (Advisor, Coach, Delegate) on human performance in multi-party negotiation games. Participants played bargaining games with access to one of these modalities, despite all modalities using the same underlying LLM. The key finding is a preference-performance misalignment: participants preferred the Advisor but achieved higher individual gains with the Delegate, which acted as a "market maker" by injecting Pareto-improving proposals.

Demonstrates a preference-performance misalignment in AI-assisted negotiation, revealing that users do not always adopt the AI modality that maximizes their gains or overall group welfare.

Lithium Thain, Vivian Tsai, James Wexler +12602.12089

Tool Use & AgentsNatural Language Processing

Lahore University of2d ago

On the Adoption of AI Coding Agents in Open-source Android and iOS Development

This paper presents an empirical study of AI coding agent contributions in open-source Android and iOS mobile app development by analyzing 2,901 AI-authored pull requests (PRs) from 193 GitHub repositories. The study reveals that Android projects receive more AI-authored PRs and exhibit higher acceptance rates compared to iOS, with routine tasks showing higher acceptance rates than structural changes. The analysis also indicates an initial improvement followed by a decline in PR resolution time on Android, providing insights into the evolving impact of AI agents on OSS mobile projects.

Empirically characterizes the effects of AI coding agents on open-source Android and iOS mobile app projects by analyzing PR acceptance behaviors across platforms, agents, and task categories.

Hasnain Ali, Muneeb Rana, Muhammad Saqib Ilyas +12602.12144

Code Generation & Program SynthesisTool Use & AgentsOpen-Source Models & Weights

2d ago

Agentic Test-Time Scaling for WebAgents

The paper investigates test-time scaling strategies for web agents in multi-step tasks, finding that uniform scaling saturates quickly and LLM-based arbiters can overrule high-consensus decisions. They demonstrate that uncertainty statistics from the agent's vote distribution correlate with task success, enabling dynamic compute allocation. Based on these findings, they introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are contentious, improving performance and efficiency.

Introduces Confidence-Aware Test-Time Scaling (CATTS), a novel method for dynamically allocating compute to web agents based on vote-derived uncertainty, achieving improved performance and efficiency compared to uniform scaling.

Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai +32602.12276

Tool Use & AgentsInference & QuantizationEval Frameworks & Benchmarks

2d ago

Think like a Scientist: Physics-guided LLM Agent for Equation Discovery

The paper introduces KeplerAgent, an LLM-based agent designed for symbolic equation discovery that mimics the scientific reasoning process of inferring physical properties before guessing equations. KeplerAgent coordinates physics-based tools to extract intermediate structure from data and uses this information to configure symbolic regression engines like PySINDy and PySR. Experiments on physical equation benchmarks demonstrate that KeplerAgent achieves significantly higher symbolic accuracy and robustness to noisy data compared to existing LLM and traditional baselines.

Introduces KeplerAgent, an agentic framework that enhances symbolic equation discovery by explicitly modeling the scientific reasoning process of inferring physical properties and using them to constrain the search space of candidate equations.

Jianwei Yang, Ohm Venkatachalam, Mohammad Kianezhad +12602.12259

Reasoning & Chain-of-ThoughtTool Use & AgentsScientific Discovery & Drug Design

2d ago

Convex Markov Games and Beyond: New Proof of Existence, Characterization and Learning Algorithms for Nash Equilibria

This paper introduces General Utility Markov Games (GUMGs), an extension of Convex Markov Games (cMGs) that allows for coupling between agents' occupancy measures, and proves that Nash equilibria in GUMGs coincide with fixed points of projected pseudo-gradient dynamics due to a novel agent-wise gradient domination property. Leveraging this characterization, the authors provide a simplified proof of Nash equilibrium existence, demonstrate the existence of Markov perfect equilibria, and derive a policy gradient theorem for GUMGs. Furthermore, they establish iteration and sample complexity guarantees for computing approximate-NE in potential GUMGs using policy gradient methods.

Establishes a novel agent-wise gradient domination property in General Utility Markov Games (GUMGs), enabling a characterization of Nash equilibria as fixed points of projected pseudo-gradient dynamics and facilitating the design and analysis of policy gradient algorithms.

Anas Barakat, Ioannis Panageas, Antonios Varvitsiotis2602.12181

Tool Use & AgentsRobotics & Embodied AI

2d ago

Text2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment, Analysis&Benchmark]

The authors introduce Text2GQL-Bench, a new benchmark for text-to-graph query language translation, comprising 178,184 question-query pairs across 13 domains and supporting multiple graph query languages. They also present a comprehensive evaluation method that assesses grammatical validity, similarity, semantic alignment, and execution accuracy, moving beyond simple end-to-end metrics. Experiments reveal a significant "dialect gap" in ISO-GQL generation, where even strong LLMs struggle in zero-shot settings but improve substantially with few-shot prompting or fine-tuning.

Introduces a unified benchmark, Text2GQL-Bench, for evaluating text-to-graph query language systems, featuring a multi-GQL dataset and a scalable construction framework.

Songlin Lyu, Lujie Ban, Jirong Liu +52602.11745

Eval Frameworks & BenchmarksCode Generation & Program SynthesisTool Use & Agents

2d ago

ModelWisdom: An Integrated Toolkit for TLA+ Model Visualization, Digest and Repair

The paper introduces ModelWisdom, a toolkit designed to enhance the interpretability and usability of TLA+ model checking by addressing challenges in counterexample analysis and model repair. ModelWisdom integrates visualization techniques, graph optimization, LLM-based summarization, and automated repair suggestions to improve the debugging process. The toolkit's capabilities, including colorized violation highlighting, graph folding, and LLM-powered explanations, facilitate a more interactive and understandable workflow for TLA+ specifications.

Introduces an interactive environment, ModelWisdom, that leverages visualization and large language models to improve the interpretability and actionability of TLA+ model checking.

Zhiyong Chen, S. Cheung2602.12058

Interpretability & Mechanistic InterpCode Generation & Program SynthesisTool Use & Agents

2d ago

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

This paper investigates the effectiveness of repository-level context files (e.g., AGENTS.md) in improving the performance of coding agents on software development tasks. Through experiments on SWE-bench tasks with LLM-generated context files and a novel dataset of issues from repositories with developer-committed context files, the authors find that context files generally decrease task success rates and increase inference costs. They attribute this to unnecessary constraints imposed by the context files, suggesting that human-written context files should be minimal.

Empirically demonstrates that repository-level context files, both LLM-generated and human-written, can hinder the performance of coding agents on software development tasks.

Thibaud Gloaguen, Niels Mundler, M. Muller +22602.11988

Code Generation & Program SynthesisTool Use & AgentsEval Frameworks & Benchmarks

2d ago

Who Does What? Archetypes of Roles Assigned to LLMs During Human-AI Decision-Making

This paper introduces the concept of human-LLM archetypes, defined as recurring socio-technical interaction patterns that structure the roles of humans and LLMs in collaborative decision-making. Through a scoping literature review and thematic analysis of 113 papers, the authors identified 17 distinct human-LLM archetypes. They then evaluated these archetypes across clinical diagnostic cases, demonstrating that the choice of archetype influences LLM outputs and decision outcomes.

Defines and categorizes 17 human-LLM interaction archetypes to demonstrate how these archetypes impact LLM outputs and decisions in human-AI collaborative decision-making.

S. Chappidi, A. Krauze2602.11924

Tool Use & AgentsConstitutional AI & AI EthicsNatural Language Processing

2d ago

AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection

The paper addresses the computational inefficiency of evolutionary AI agents that repeatedly invoke LLMs by proposing AdaptEvolve, a framework for adaptive LLM selection during evolutionary refinement. AdaptEvolve uses intrinsic generation confidence to estimate real-time solvability and dynamically selects an LLM appropriate for the current generation step. Experiments demonstrate that confidence-driven selection achieves a better Pareto frontier, reducing inference costs by 37.9% while maintaining 97.5% of the accuracy of static large models.

Introduces AdaptEvolve, a novel adaptive LLM selection framework for evolutionary AI agents that leverages intrinsic generation confidence to dynamically choose the most efficient LLM for each generation step.

Pretam Ray, P. Brahma, E. Barsoum2602.11931

Tool Use & AgentsInference & QuantizationTraining Efficiency & Optimization

2d ago

Adaptive Milestone Reward for GUI Agents

The paper addresses the challenge of sparse rewards in Reinforcement Learning for GUI agents by introducing Adaptive Milestone Reward (ADMIRE), a mechanism that dynamically distills milestones from successful explorations to provide verifiable, adaptive rewards. ADMIRE employs an asymmetric credit assignment strategy to denoise successful trajectories and scaffold failed ones, effectively balancing reward fidelity and density. Experiments on AndroidWorld demonstrate over 10% improvement in success rate across different base models, with strong generalizability observed in web navigation and embodied tasks.

Introduces ADMIRE, an adaptive milestone reward mechanism with asymmetric credit assignment, to improve temporal credit assignment in long-horizon GUI agent tasks.

Xiaoyun Mo, Jiachen Zhu, Xingyu Lou +42602.11524

Tool Use & AgentsRobotics & Embodied AI

2d ago

PhyNiKCE: A Neurosymbolic Agentic Framework for Autonomous Computational Fluid Dynamics

The paper introduces PhyNiKCE, a neurosymbolic agentic framework that addresses the limitations of LLMs in autonomous CFD by decoupling neural planning from symbolic validation. PhyNiKCE uses a Symbolic Knowledge Engine to enforce physical constraints via a Deterministic RAG Engine, treating simulation setup as a Constraint Satisfaction Problem. Experiments using OpenFOAM and Gemini-2.5-Pro/Flash demonstrate a 96% improvement over baselines, a 59% reduction in self-correction loops, and a 17% decrease in LLM token consumption.

Introduces PhyNiKCE, a neurosymbolic framework that integrates neural planning with symbolic constraint enforcement to improve the reliability and efficiency of autonomous CFD agents.

E. Fan2602.11666

Tool Use & AgentsScientific Discovery & Drug DesignCode Generation & Program SynthesisReasoning & Chain-of-Thought

2d ago

Talk2DM: Enabling Natural Language Querying and Commonsense Reasoning for Vehicle-Road-Cloud Integrated Dynamic Maps with Large Language Models

This paper introduces Talk2DM, a plug-and-play module designed to enhance vehicle-road-cloud dynamic map (VRC-DM) systems with natural language querying and commonsense reasoning capabilities. To facilitate this, the authors created VRCsim, a VRC cooperative perception simulation framework, and VRC-QA, a question-answering dataset focused on spatial reasoning in mixed-traffic scenarios. Talk2DM leverages a novel chain-of-prompt (CoP) mechanism to integrate human-defined rules with LLM knowledge, achieving high accuracy and reasonable response times with models like Qwen3:8B, Gemma3:27B, and GPT-oss.

Introduces a chain-of-prompting method (CoP) that enables LLMs to effectively query and reason about dynamic maps by combining human-defined rules with the LLM's inherent commonsense knowledge.

Lu Tao, Jinxuan Luo, Shen Ying +22602.11860

Reasoning & Chain-of-ThoughtTool Use & AgentsNatural Language Processing

2d ago

Effective Task Planning with Missing Objects using Learning-Informed Object Search

This paper introduces a task planning framework that integrates Learning-Informed Object Search (LIOS) actions into high-level planning to address scenarios with missing objects. The framework models LIOS actions as deterministic, leveraging model-based calculations to estimate their cost and interleave search and execution steps. The approach demonstrates effective task planning with uncertainty, outperforming both non-learned and learned baselines in simulated ProcTHOR environments and real-world experiments involving retrieval and meal preparation tasks.

Introduces a novel planning framework that integrates learning-informed object search (LIOS) actions into task planning, enabling effective handling of missing objects by interleaving search and execution.

Raihan Islam Arnob, Max Merlin, Abhishek Paudel +32602.11468

World Models & PlanningTool Use & AgentsRobotics & Embodied AI

2d ago

FlowMind: Execute-Summarize for Structured Workflow Generation from LLM Reasoning

The paper introduces Execute-Summarize (ES), a framework that decouples task execution from workflow construction in LLMs, addressing the challenge of accurately translating LLM reasoning into structured workflows. ES first completes the task using available tools and then independently reconstructs a structured workflow from execution traces. Experiments on the newly introduced FlowBench demonstrate that ES outperforms existing methods, establishing a more reliable paradigm for grounding free-form LLM reasoning into structured workflows.

Introduces Execute-Summarize (ES), a novel framework that decouples task execution and workflow construction to improve the accuracy and robustness of structured workflow generation from LLM reasoning.

Yihao Liu, Zile He2602.11782

Reasoning & Chain-of-ThoughtTool Use & AgentsCode Generation & Program Synthesis

2d ago

AIR: Improving Agent Safety through Incident Response

The paper introduces AIR, an incident response framework for LLM agents that enables autonomous detection, containment, and recovery from failures. AIR uses a domain-specific language integrated into the agent's execution loop to perform semantic checks, guide recovery actions, and synthesize guardrail rules. Experiments across three agent types demonstrate that AIR achieves over 90% success rates in detection, remediation, and eradication, highlighting the importance of incident response for agent safety.

Introduces AIR, a novel incident response framework for LLM agents, enabling autonomous management of the incident lifecycle.

Zibo Xiao, Junjie Chen2602.11749

Tool Use & AgentsConstitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

2d ago

LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

The paper investigates how to best pretrain small language models (SLMs) to decide which tokens to predict directly versus delegating to an external source via a special token. They find that loss alone is insufficient for determining optimal delegation, as some high-loss tokens represent acceptable alternative continuations. They introduce LaCy, a pretraining method that uses a spaCy grammar parser to augment the loss signal, enabling SLMs to learn when to delegate and resulting in improved FactScore in cascaded generation setups compared to other methods.

Introduces LaCy, a pretraining method that leverages a spaCy grammar parser to augment the loss signal, enabling SLMs to learn when to delegate token prediction to an external source.

Szilvia Ujv'ary, Louis B'ethune, Pierre Ablin +32602.12005

Eval Frameworks & BenchmarksTool Use & AgentsRecommendation & Information Retrieval

2d ago

SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent

The paper introduces SIGHT, a reinforcement learning framework designed to improve search-based reasoning in LLMs by mitigating redundancy and noise in search results. SIGHT uses Self-Evidence Support (SES) to distill search results into high-fidelity evidence and employs an Information Gain score to identify pivotal states for Dynamic Prompting Interventions like de-duplication and adaptive branching. By integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT achieves superior performance on single-hop and multi-hop QA benchmarks with fewer search steps compared to existing methods.

Introduces a novel reinforcement learning framework, SIGHT, that leverages self-evidence support and information-gain driven diverse branching to enhance search-based reasoning in LLMs.

Jinluan Yang, Yiquan Wu, Yi Liu +22602.11551

Tool Use & AgentsReasoning & Chain-of-ThoughtRecommendation & Information Retrieval

2d ago

Digital Ecosystems: Enabling Collaboration in a Fragmented World

This paper introduces a spectrum framework for polycentric digital ecosystems, conceptualizing them as nested socio-technical systems across personal, organizational, inter-organizational, and global layers. It addresses the increasing need for resilient digital collaboration amidst geopolitical and technological fragmentation. The framework highlights how AI and automation, blockchain trust, federated data spaces, and immersive technologies can orchestrate digital integration in these ecosystems.

Introduces a multi-layered framework for polycentric digital ecosystems to facilitate collaboration in fragmented environments.

Marc Schmitt2602.11707

Tool Use & AgentsNatural Language ProcessingDistributed Systems & Hardware

2d ago

Differentiable Modal Logic for Multi-Agent Diagnosis, Orchestration and Communication

This paper introduces Differentiable Modal Logic (DML) implemented via Modal Logical Neural Networks (MLNNs) to enable multi-agent systems to learn relationships like trust networks and causal chains from behavioral data. DML addresses the limitations of traditional modal logic, which requires manual specification of relationship structures. The authors demonstrate a neurosymbolic debugging framework across epistemic, temporal, deontic, and doxastic modalities, showing how logical contradictions can be formulated as learnable optimization objectives in scenarios ranging from diplomacy games to LLM hallucination detection.

Introduces Differentiable Modal Logic (DML) and Modal Logical Neural Networks (MLNNs) to learn interpretable relationship structures in multi-agent systems directly from data, replacing manual specification.

Antonin Sulc2602.12083

Reasoning & Chain-of-ThoughtTool Use & AgentsNatural Language Processing

2d ago

Intelligent AI Delegation

The paper introduces a framework for intelligent AI delegation, enabling AI agents to decompose complex tasks and delegate sub-components to other AI agents or humans. This framework addresses limitations in current task decomposition methods by incorporating elements like authority transfer, accountability, and trust-building. The authors propose an adaptive approach applicable to both AI and human agents within complex delegation networks, contributing to the development of protocols for agentic systems.

Proposes a novel adaptive framework for intelligent AI delegation that incorporates key elements of human delegation such as authority transfer, accountability, and trust.

Nenad Tomavsev, Matija Franklin, Simon Osindero2602.11865

Tool Use & AgentsWorld Models & Planning

2d ago

Cooperation Breakdown in LLM Agents Under Communication Delays

This paper investigates the impact of communication delays on cooperation in LLM-based multi-agent systems using a Continuous Prisoner's Dilemma. The authors introduce the FLCOA framework to emphasize the importance of lower-layer factors like communication resources in multi-agent cooperation. Their simulations reveal a U-shaped relationship between communication delay and mutual cooperation, where increased delay initially leads to exploitation but excessive delay reduces exploitation cycles.

Demonstrates that communication delays in LLM-based multi-agent systems can significantly impact cooperation, leading to exploitation and a non-monotonic relationship between delay magnitude and mutual cooperation.

Keita Nishimoto, K. Asatani, Ichiro Sakata2602.11754

Tool Use & AgentsReasoning & Chain-of-Thought

2d ago

MalTool: Malicious Tool Attacks on LLM Agents

This paper introduces MalTool, a framework leveraging coding LLMs to automatically generate malicious tools that can compromise user security and privacy when used by LLM agents. The authors propose a taxonomy of malicious tool behaviors based on the CIA triad and use MalTool to synthesize both standalone malicious tools and real-world tools with embedded malicious behaviors. Experiments demonstrate MalTool's effectiveness in generating malicious tools, even with safety-aligned coding LLMs, and reveal the limitations of existing detection methods, underscoring the need for improved defenses.

Introduces MalTool, a novel framework for automatically generating malicious tools using coding LLMs, enabling a systematic study of malicious tool code implementations and their impact on LLM agent security.

Yuepeng Hu, Mengyuan Li, N. Gong2602.12194

Red-Teaming & Adversarial RobustnessTool Use & AgentsCode Generation & Program Synthesis

2d ago

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

This paper introduces ImagineAgent, a framework that uses cognitive reasoning and generative imagination to improve Open-Vocabulary Human-Object Interaction (OV-HOI) comprehension. ImagineAgent constructs cognitive maps to model relationships between entities and actions, and uses retrieval augmentation, image cropping, and diffusion models to gather knowledge and visual evidence. Experiments on SWIG-HOI and HICO-DET show state-of-the-art performance with significantly less training data.

Introduces ImagineAgent, a novel agentic framework that leverages cognitive maps and generative tools to enhance OV-HOI comprehension by mitigating cross-modal hallucinations and occlusion ambiguity.

Zhenlong Yuan, Xiangyan Qu, Lei Sun +42602.11499

Tool Use & AgentsMultimodal ModelsComputer Vision

2d ago

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

The paper introduces CM2, a reinforcement learning framework that utilizes checklist rewards instead of verifiable outcome rewards to train agents for multi-turn, multi-step tool use. CM2 decomposes each turn's behavior into fine-grained binary criteria with evidence grounding, enabling more stable classification-style reward signals. Experiments in an LLM-simulated tool environment demonstrate that CM2 significantly outperforms supervised fine-tuning baselines on benchmarks like tau^-Bench, BFCL-V4, and ToolSandbox, achieving comparable or superior performance to similarly sized open-source models.

This paper introduces a novel reinforcement learning framework, CM2, that replaces traditional verifiable rewards with checklist-based rewards for training agents to effectively use tools in multi-turn, multi-step interactions.

Xun Wang, Yebowen Hu, Chenyang Zhao +52602.12268

RLHF & Preference LearningTool Use & AgentsEval Frameworks & Benchmarks

2d ago

Counterfactual Conditional Likelihood Rewards for Multiagent Exploration

This paper introduces Counterfactual Conditional Likelihood (CCL) rewards to address redundant exploration in multiagent systems by scoring each agent's unique contribution to team exploration. CCL rewards agents for observations that are informative with respect to the joint exploration of the team, rather than solely for individual novelty. Experiments in continuous multiagent domains demonstrate that CCL accelerates learning in sparse reward environments requiring tight coordination.

Introduces Counterfactual Conditional Likelihood (CCL) rewards to incentivize efficient team exploration by rewarding agents based on their unique contribution to the team's joint exploration.

Ayhan Alp Aydeniz, R. Loftin, Kagan Tumer2602.11740

Tool Use & AgentsRobotics & Embodied AIWorld Models & Planning

2d ago

RF-Modulated Adaptive Communication Improves Multi-Agent Robotic Exploration

This paper introduces Adaptive-RF Transmission (ART), a communication-aware planning algorithm for multi-agent robotic exploration that modulates transmission location based on signal strength and data payload size. ART aims to improve coordination and efficiency in communication-limited environments by enabling heterogeneous robot teams to share information without excessive backtracking. Simulation results across cave-inspired environments show that ART and its extension, ART-SST, outperform existing strategies, achieving significant reductions in distance traveled and exploration time.

Introduces a novel communication-aware planning algorithm, Adaptive-RF Transmission (ART), that dynamically adjusts transmission location based on signal strength and data payload size for efficient multi-agent robotic exploration.

Lorin Achey, Breanne Crockett, Christoffer Heckman +12602.12074

Robotics & Embodied AITool Use & AgentsWorld Models & Planning

2d ago

LLM-based Triplet Extraction from Financial Reports

This paper introduces a semi-automated pipeline for extracting Subject-Predicate-Object triplets from financial reports using LLMs, addressing the lack of ground truth data by employing ontology-driven proxy metrics like Ontology Conformance and Faithfulness. The authors compare a manually engineered ontology with a document-specific, automatically induced ontology, finding that the latter achieves 100% schema conformance and eliminates ontology drift. They also propose a hybrid verification strategy combining regex matching and LLM-as-a-judge to reduce subject hallucination rates, and identify asymmetries in subject/object hallucinations.

Introduces a semi-automated pipeline for LLM-based triplet extraction from financial reports evaluated using ontology-driven proxy metrics, circumventing the need for annotated ground truth.

Dante Wesslund, Ville Stenstrom, Pontus Linde +12602.11886

Natural Language ProcessingEval Frameworks & BenchmarksTool Use & Agents

2d ago

WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI Elements

The paper introduces WebTestPilot, an LLM-based agent for end-to-end web testing against natural language specifications that addresses the challenges of implicit oracle inference and probabilistic reasoning. WebTestPilot uses a symbolization layer to represent GUI elements as symbols and translates natural language into step-by-step instructions with inferred pre- and post-conditions over these symbols, effectively capturing data, temporal, and causal dependencies for validation. Experiments on a new benchmark of bug-injected web applications demonstrate that WebTestPilot achieves a 99% task completion rate with 96% precision and 96% recall in bug detection, significantly outperforming existing LLM-based approaches.

Introduces a novel approach to end-to-end web testing by inferring oracles with symbolized GUI elements, enabling the agent to validate implicit requirements and improve bug detection accuracy.

Xiwen Teoh, Yun Lin, Duc-Minh Nguyen +12602.11724

Tool Use & AgentsMultimodal ModelsEval Frameworks & Benchmarks

2d ago

Any House Any Task: Scalable Long-Horizon Planning for Abstract Human Tasks

The paper introduces Any House Any Task (AHAT), a household task planner designed for long-horizon planning in large environments with ambiguous instructions. AHAT trains an LLM to map task instructions and textual scene graphs into PDDL subgoals, which are then solved using symbolic reasoning for optimal plan generation. To improve decomposition of complex intentions, they propose TGPO, a reinforcement learning algorithm integrating external correction of intermediate reasoning traces into Group Relative Policy Optimization (GRPO), leading to significant performance gains.

Introduces a novel household task planner, AHAT, that leverages LLMs and symbolic reasoning with a new reinforcement learning algorithm, TGPO, to achieve superior long-horizon planning performance in complex, ambiguous environments.

Zhihong Liu, Cewu Lu, Panpan Cai2602.12244

Tool Use & AgentsWorld Models & PlanningRobotics & Embodied AI

2d ago

Agentic AI for Cybersecurity: A Meta-Cognitive Architecture for Governable Autonomy

This paper proposes a meta-cognitive architecture for AI-driven cybersecurity systems to address limitations in accountable decision-making under adversarial uncertainty. The architecture coordinates heterogeneous AI agents responsible for detection, hypothesis formation, explanation, and governance through an explicit meta-cognitive judgement function. By embedding meta-cognitive judgement as a first-class system function, the framework aims to make the cognitive structure of security operations explicit and governable, shifting the focus from optimizing isolated predictions to governing autonomy under uncertainty.

Introduces a meta-cognitive architectural framework for cybersecurity AI that explicitly governs decision readiness and dynamically calibrates system autonomy under uncertainty by coordinating heterogeneous AI agents through a meta-cognitive judgement function.

A. Kojukhov, Arkady Bovshover2602.11897

Tool Use & AgentsConstitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

2d ago

AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild

The paper introduces AmbiBench, a new benchmark designed to evaluate mobile GUI agents' ability to handle ambiguous instructions and engage in interactive intent alignment, moving beyond the limitations of existing benchmarks that focus on one-shot, complete instructions. The benchmark is structured around a taxonomy of instruction clarity levels (Detailed, Standard, Incomplete, Ambiguous) based on Cognitive Gap theory and includes 240 real-world tasks across 25 applications. The authors also present MUSE, an automated evaluation framework using an MLLM-as-a-judge multi-agent architecture, demonstrating its utility in assessing agent performance across different clarity levels and its correlation with human judgment.

Introduces AmbiBench, a novel benchmark for evaluating mobile GUI agents on their ability to handle ambiguous instructions and engage in interactive intent alignment, along with an automated evaluation framework called MUSE.

Jiazheng Sun, Mingxuan Li, Yingying Zhang +82602.11750

Eval Frameworks & BenchmarksTool Use & AgentsMultimodal Models

2d ago

LawThinker: A Deep Research Legal Agent in Dynamic Environments

The paper introduces LawThinker, a legal reasoning agent designed to improve the accuracy and procedural compliance of legal reasoning in dynamic environments. LawThinker employs an Explore-Verify-Memorize strategy, integrating a DeepVerifier module to assess knowledge accuracy, fact-law relevance, and procedural compliance after each knowledge exploration step. Experiments on the J1-EVAL benchmark demonstrate a 24% improvement over direct reasoning and an 11% improvement over workflow-based methods, along with strong generalization across three static benchmarks.

Introduces an Explore-Verify-Memorize strategy with a DeepVerifier module to enforce verification as an atomic operation after each knowledge exploration step in legal reasoning.

Tongyu Wen, Zhicheng Dou2602.12056

Reasoning & Chain-of-ThoughtTool Use & AgentsNatural Language Processing

2d ago

How Smart Is Your GUI Agent? A Framework for the Future of Software Interaction

This paper introduces the GUI Agent Autonomy Levels (GAL) framework, a six-level scale for classifying the autonomy of GUI agents interacting with software. The framework aims to clarify the varying degrees of autonomy currently attributed to GUI agents, addressing ambiguity in capability, responsibility, and risk. By providing a standardized benchmark, GAL facilitates progress towards more trustworthy software interaction.

Proposes the GUI Agent Autonomy Levels (GAL) framework to categorize and benchmark the autonomy of GUI agents.

2602.11514

Tool Use & AgentsEval Frameworks & Benchmarks

2d ago

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

The paper addresses the "Shallow Exploration Trap" in in-context learning, where autoregressive models struggle to generate long reasoning trajectories needed for broader state coverage. They propose Length-Incentivized Exploration (LIE), a reinforcement learning approach that rewards longer reasoning trajectories while penalizing redundancy. Experiments on Qwen3 and Llama models demonstrate that LIE improves in-context exploration, leading to performance gains of 4.4% on in-domain and 2.7% on out-of-domain tasks.

Introduces Length-Incentivized Exploration (LIE), a novel reinforcement learning method to encourage longer and more diverse reasoning trajectories in in-context learning by rewarding length and penalizing redundancy.

Yun Luo, Ganqu Cui, Zhi Wang +32602.11748

RLHF & Preference LearningReasoning & Chain-of-ThoughtTool Use & Agents

2d ago

Budget-Constrained Agentic Large Language Models: Intention-Based Planning for Costly Tool Use

This paper introduces INTENT, a novel inference-time planning framework for budget-constrained, tool-augmented LLMs that addresses the challenge of costly tool use in sequential decision-making. INTENT uses an intention-aware hierarchical world model to anticipate future tool usage and risk-calibrated costs, enabling more effective online decision-making. Experiments on a cost-augmented StableToolBench demonstrate that INTENT achieves superior task success while strictly adhering to budget constraints, even under dynamic market conditions.

Introduces INTENT, an inference-time planning framework that leverages an intention-aware hierarchical world model for budget-constrained tool use in LLMs.

Nan An, Qi Qi2602.11541

Tool Use & AgentsWorld Models & PlanningReasoning & Chain-of-Thought

2d ago

STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction

The paper introduces STAR, a framework for predicting large language model performance from limited data by combining statistical methods with agentic reasoning. STAR uses specialized retrievers for external knowledge and embeds semantic features into Constrained Probabilistic Matrix Factorization (CPMF) to generate statistical expectations with uncertainty. A reasoning module based on Expectation Violation Theory (EVT) then refines these predictions, achieving a 14.46% improvement over statistical baselines under extreme data sparsity.

Introduces a hybrid framework, STAR, that integrates statistical expectations with agentic reasoning to improve LLM performance prediction, particularly under data sparsity.

Xiaoxiao Wang, Chunxiao Li, Junying Wang +42602.12143

Eval Frameworks & BenchmarksReasoning & Chain-of-ThoughtTool Use & Agents

2d ago

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

The paper introduces Trajectory-Search Rollouts (TSR), a training-time method that uses lightweight tree search to improve the quality of rollouts in multi-turn reinforcement learning for LLM agents. TSR selects high-scoring actions at each turn during rollout generation using task-specific feedback, leading to more informative training trajectories. Experiments on Sokoban, FrozenLake, and WebShop demonstrate that TSR, when combined with PPO and GRPO, achieves up to 15% performance gains and more stable learning.

Introduces a novel training-time trajectory generation method, TSR, that leverages lightweight tree search to construct higher-quality rollouts for multi-turn RL of LLM agents.

Aladin Djuhera, S. Kadhe, Holger Boche2602.11767

RLHF & Preference LearningTool Use & AgentsWorld Models & Planning

2d ago

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

The paper introduces Gaia2, a benchmark designed to evaluate LLM agents in dynamic, asynchronous environments where the environment evolves independently of agent actions. Gaia2 features scenarios requiring agents to handle temporal constraints, adapt to noisy events, resolve ambiguity, and collaborate, coupled with write-action verifiers for fine-grained evaluation. Evaluations of state-of-the-art models reveal trade-offs between reasoning, efficiency, and robustness, with GPT-5 achieving the highest overall score (42% pass@1) but struggling with time-sensitive tasks.

Introduces Gaia2, a novel benchmark for evaluating LLM agents in realistic, asynchronous environments with action-level verification.

Romain Froger, Pierre Andrews, Matteo Bettini +212602.11964

Eval Frameworks & BenchmarksTool Use & AgentsWorld Models & Planning

2d ago·affiliated lab: Tsinghua AI

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

The paper introduces LAVES, a hierarchical LLM-based multi-agent system, to generate high-quality instructional videos from educational problems by decomposing the generation workflow into specialized agents for problem-solving, visualization, and narration. LAVES addresses limitations of end-to-end video generation models in scenarios requiring logical rigor and precise knowledge representation. The system achieves a throughput of over one million videos per day with a 95% cost reduction compared to industry standards, while maintaining a high acceptance rate, by constructing a structured executable video script compiled into synchronized visuals and narration.

Introduces a hierarchical LLM-based multi-agent system (LAVES) that decomposes educational video generation into specialized agents, enabling automated end-to-end production with high throughput and cost efficiency.

Jiulong Wu, Dong Xie, Deguo Xia +12602.11790

Tool Use & AgentsMultimodal ModelsComputer Vision

2d ago

The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

The paper introduces StateLM, a language model architecture with an internal reasoning loop and memory management tools (context pruning, document indexing, note-taking) that allows it to actively manage its own context. This addresses the limitation of standard LLMs that passively accept a fixed, manually engineered context. Experiments demonstrate that StateLM outperforms standard LLMs on long-document QA, chat memory, and complex research tasks, achieving significant accuracy improvements.

Empowers language models to actively manage their own context by introducing an internal reasoning loop and memory management tools, enabling stateful reasoning.

Dongyang Ma, Deyu Zhou, Haitao Mi +12602.12108

Architecture Design (Transformers, SSMs, MoE)Tool Use & AgentsNatural Language Processing

Feb 11, 2026

3d ago

GameDevBench: Evaluating Agentic Capabilities Through Game Development

The paper introduces GameDevBench, a new benchmark for evaluating multimodal agents in game development, a domain requiring complex code manipulation and multimodal asset handling. The benchmark comprises 132 tasks derived from tutorials, demanding significantly more code and file changes compared to existing software development benchmarks. Experiments reveal that current agents struggle with game development tasks, particularly those involving 2D graphics, but performance can be improved by incorporating image and video-based feedback mechanisms.

Introduces GameDevBench, a novel benchmark designed to evaluate and advance multimodal agents in the challenging domain of game development.

Wayne Chi, Arnav Yayavaram, Siddharth Yayavaram +62602.11103

Eval Frameworks & BenchmarksCode Generation & Program SynthesisTool Use & Agents

Feb 10, 2026

4d ago

Beyond Syntax: The Paradigm Shift to System Architecture in Large Language Model (LLM) Driven Development

This paper examines the shift in software engineering roles due to LLMs' code generation capabilities, arguing that system architecture is becoming the primary unit of engineering value. It uses case studies from the development of two systems, *Gaari* and *The Trail*, to illustrate how the engineering bottleneck is moving from syntax to system design. The paper concludes that modern engineers must transition to a "System Architect" model focused on logic and architecture.

Argues that the core engineering value in LLM-driven development is shifting from syntax to system architecture, requiring engineers to adopt a "System Architect" mindset.

Rizwanul Islam Afraim

Code Generation & Program SynthesisArchitecture Design (Transformers, SSMs, MoE)Tool Use & Agents

4d ago

ArtisanGS: Interactive Tools for Gaussian Splat Selection with AI and Human in the Loop

The paper introduces ArtisanGS, an interactive tool suite for selecting and segmenting 3D Gaussian Splats (3DGS) to enable controllable editing of in-the-wild captures. It presents a fast AI-driven method for propagating user-guided 2D selection masks to 3DGS selections, supplemented by manual selection and segmentation tools for user intervention. The toolset's utility is demonstrated through user-guided local editing using a custom Video Diffusion Model, achieving binary segmentation of unstructured 3DGS scenes without additional optimization.

Introduces an interactive tool suite, ArtisanGS, for versatile Gaussian Splat selection and segmentation, enabling user-guided editing via a novel AI-driven propagation method and manual tools.

Clement Fuji Tsang, Anita Hu, Or Perel +22602.10173

Tool Use & AgentsComputer VisionRobotics & Embodied AI

4d ago

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

This paper investigates the "self-evolution trilemma" in multi-agent LLM systems, demonstrating the impossibility of simultaneously achieving continuous self-evolution, complete isolation, and safety invariance. Using an information-theoretic framework, the authors formalize safety as the divergence from anthropic value distributions and prove that isolated self-evolution leads to statistical blind spots, causing irreversible safety degradation. Empirical results from the Moltbook agent community and two closed self-evolving systems validate the theoretical prediction of inevitable safety erosion, highlighting the need for external oversight.

Proves the impossibility of simultaneously achieving continuous self-evolution, complete isolation, and safety invariance in multi-agent LLM systems, formalizing this as the "self-evolution trilemma."

Chenxu Wang, Chaozhuo Li, Songyang Liu +62602.09877

Constitutional AI & AI EthicsScalable Oversight & Alignment TheoryTool Use & Agents

Feb 9, 2026

5d ago

Dreaming in Code for Curriculum Learning in Open-Ended Worlds

The paper introduces Dreaming in Code (DiCode), a framework that uses foundation models to generate executable environment code variations for curriculum learning in open-ended environments. DiCode addresses the challenge of discovering learnable sequences of experiences in complex environments by "dreaming" code-level variations of the world to scaffold learning. Experiments in the Craftax environment demonstrate that DiCode enables agents to acquire long-horizon skills, achieving a 16% improvement in mean return over the strongest baseline and success on late-game combat tasks where prior methods fail.

Introduces DiCode, a novel framework leveraging foundation models to synthesize executable environment code for curriculum learning, enabling agents to acquire complex skills in open-ended environments.

Konstantinos Mitsides, Maxence Faldor, Antoine Cully2602.08194

Code Generation & Program SynthesisTool Use & AgentsWorld Models & Planning

Feb 8, 2026

6d ago

MemFly: On-the-Fly Memory Optimization via Information Bottleneck

This paper introduces MemFly, a framework for on-the-fly memory optimization in LLMs based on the information bottleneck principle. MemFly uses a gradient-free optimizer to minimize compression entropy while maximizing relevance entropy, creating a stratified memory structure. The framework incorporates a hybrid retrieval mechanism combining semantic, symbolic, and topological pathways, achieving superior performance in memory coherence, response fidelity, and accuracy compared to existing methods.

Introduces an information bottleneck-based framework, MemFly, for on-the-fly memory optimization in LLMs, enabling efficient compression and precise retrieval.

Zhenyuan Zhang, Xianzhang Jia, Zhiqin Yang +42602.07885

Tool Use & AgentsInference & QuantizationRecommendation & Information Retrieval

Feb 6, 2026

1w ago

From Features to Actions: Explainability in Traditional and Agentic AI Systems

This paper investigates the applicability of attribution-based explainability methods, commonly used for static classification tasks, to agentic AI systems where behavior emerges over multi-step trajectories. The authors compare attribution-based explanations with trace-based diagnostics in both static classification and agentic benchmarks (TAU-bench Airline and AssistantBench). They find that attribution methods, while stable in static settings, are unreliable for diagnosing execution-level failures in agentic trajectories, whereas trace-grounded rubric evaluation effectively localizes behavior breakdowns.

Demonstrates the limitations of applying attribution-based explainability methods designed for static predictions to agentic AI systems and advocates for trajectory-level explainability.

Sindhuja Chaduvula, Kina Kim, Aravind Narayanan +42602.06841

Interpretability & Mechanistic InterpTool Use & Agents

Feb 5, 2026

1w ago

Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

This paper introduces SparseVideoNav, a novel approach to Beyond-the-View Navigation (BVN) that leverages video generation models to enable agents to navigate using only high-level intents. The key insight is that video generation models inherently benefit from long-horizon supervision, making them well-suited for BVN tasks where agents must locate distant, unseen targets. By generating sparse future trajectories spanning a 20-second horizon, SparseVideoNav achieves a 27x speed-up compared to unoptimized video generation, resulting in a 2.5x improvement in success rate over LLM baselines in real-world zero-shot experiments, including challenging night scenes.

Introduces SparseVideoNav, a novel framework integrating video generation for efficient and effective beyond-the-view vision-language navigation.

Hai Zhang, Siqi Liang, Li Chen +32602.05827

Tool Use & AgentsMultimodal ModelsRobotics & Embodied AI

1w ago

Affordance-Aware Interactive Decision-Making and Execution for Ambiguous Instructions

The paper introduces AIDE, a dual-stream framework for robots to execute ambiguous instructions in unfamiliar environments by interactively identifying task-relevant objects. AIDE uses Multi-Stage Inference (MSI) for decision-making and Accelerated Decision-Making (ADM) for execution, enabling zero-shot affordance analysis and instruction interpretation. Experiments demonstrate AIDE achieves over 80% task planning success and 95% accuracy in closed-loop execution at 10 Hz, surpassing existing VLM-based methods.

Introduces a dual-stream framework, AIDE, that integrates interactive exploration with vision-language reasoning for robots to execute ambiguous instructions through zero-shot affordance analysis.

Hengxuan Xu, Fengbo Lan, Zhixin Zhao +52602.05273

Tool Use & AgentsMultimodal ModelsRobotics & Embodied AI

Lattice is designed for desktop

Tool Use & Agents

Keywords

Top Labs in This Topic

Recent Papers