Tool Use & Agents
CapabilitiesLLM-based autonomous agents, tool-augmented language models, function calling, and agentic workflows.
Keywords
Top Labs in This Topic
Recent Papers
This paper investigates the impact of different LLM-powered AI assistance modalities (Advisor, Coach, Delegate) on human performance in multi-party negotiation games. Participants played bargaining games with access to one of these modalities, despite all modalities using the same underlying LLM. The key finding is a preference-performance misalignment: participants preferred the Advisor but achieved higher individual gains with the Delegate, which acted as a "market maker" by injecting Pareto-improving proposals.
Demonstrates a preference-performance misalignment in AI-assisted negotiation, revealing that users do not always adopt the AI modality that maximizes their gains or overall group welfare.
This paper presents an empirical study of AI coding agent contributions in open-source Android and iOS mobile app development by analyzing 2,901 AI-authored pull requests (PRs) from 193 GitHub repositories. The study reveals that Android projects receive more AI-authored PRs and exhibit higher acceptance rates compared to iOS, with routine tasks showing higher acceptance rates than structural changes. The analysis also indicates an initial improvement followed by a decline in PR resolution time on Android, providing insights into the evolving impact of AI agents on OSS mobile projects.
Empirically characterizes the effects of AI coding agents on open-source Android and iOS mobile app projects by analyzing PR acceptance behaviors across platforms, agents, and task categories.
The paper investigates test-time scaling strategies for web agents in multi-step tasks, finding that uniform scaling saturates quickly and LLM-based arbiters can overrule high-consensus decisions. They demonstrate that uncertainty statistics from the agent's vote distribution correlate with task success, enabling dynamic compute allocation. Based on these findings, they introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are contentious, improving performance and efficiency.
Introduces Confidence-Aware Test-Time Scaling (CATTS), a novel method for dynamically allocating compute to web agents based on vote-derived uncertainty, achieving improved performance and efficiency compared to uniform scaling.
The paper introduces KeplerAgent, an LLM-based agent designed for symbolic equation discovery that mimics the scientific reasoning process of inferring physical properties before guessing equations. KeplerAgent coordinates physics-based tools to extract intermediate structure from data and uses this information to configure symbolic regression engines like PySINDy and PySR. Experiments on physical equation benchmarks demonstrate that KeplerAgent achieves significantly higher symbolic accuracy and robustness to noisy data compared to existing LLM and traditional baselines.
Introduces KeplerAgent, an agentic framework that enhances symbolic equation discovery by explicitly modeling the scientific reasoning process of inferring physical properties and using them to constrain the search space of candidate equations.
This paper introduces General Utility Markov Games (GUMGs), an extension of Convex Markov Games (cMGs) that allows for coupling between agents' occupancy measures, and proves that Nash equilibria in GUMGs coincide with fixed points of projected pseudo-gradient dynamics due to a novel agent-wise gradient domination property. Leveraging this characterization, the authors provide a simplified proof of Nash equilibrium existence, demonstrate the existence of Markov perfect equilibria, and derive a policy gradient theorem for GUMGs. Furthermore, they establish iteration and sample complexity guarantees for computing approximate-NE in potential GUMGs using policy gradient methods.
Establishes a novel agent-wise gradient domination property in General Utility Markov Games (GUMGs), enabling a characterization of Nash equilibria as fixed points of projected pseudo-gradient dynamics and facilitating the design and analysis of policy gradient algorithms.
The authors introduce Text2GQL-Bench, a new benchmark for text-to-graph query language translation, comprising 178,184 question-query pairs across 13 domains and supporting multiple graph query languages. They also present a comprehensive evaluation method that assesses grammatical validity, similarity, semantic alignment, and execution accuracy, moving beyond simple end-to-end metrics. Experiments reveal a significant "dialect gap" in ISO-GQL generation, where even strong LLMs struggle in zero-shot settings but improve substantially with few-shot prompting or fine-tuning.
Introduces a unified benchmark, Text2GQL-Bench, for evaluating text-to-graph query language systems, featuring a multi-GQL dataset and a scalable construction framework.
The paper introduces ModelWisdom, a toolkit designed to enhance the interpretability and usability of TLA+ model checking by addressing challenges in counterexample analysis and model repair. ModelWisdom integrates visualization techniques, graph optimization, LLM-based summarization, and automated repair suggestions to improve the debugging process. The toolkit's capabilities, including colorized violation highlighting, graph folding, and LLM-powered explanations, facilitate a more interactive and understandable workflow for TLA+ specifications.
Introduces an interactive environment, ModelWisdom, that leverages visualization and large language models to improve the interpretability and actionability of TLA+ model checking.
This paper investigates the effectiveness of repository-level context files (e.g., AGENTS.md) in improving the performance of coding agents on software development tasks. Through experiments on SWE-bench tasks with LLM-generated context files and a novel dataset of issues from repositories with developer-committed context files, the authors find that context files generally decrease task success rates and increase inference costs. They attribute this to unnecessary constraints imposed by the context files, suggesting that human-written context files should be minimal.
Empirically demonstrates that repository-level context files, both LLM-generated and human-written, can hinder the performance of coding agents on software development tasks.
This paper introduces the concept of human-LLM archetypes, defined as recurring socio-technical interaction patterns that structure the roles of humans and LLMs in collaborative decision-making. Through a scoping literature review and thematic analysis of 113 papers, the authors identified 17 distinct human-LLM archetypes. They then evaluated these archetypes across clinical diagnostic cases, demonstrating that the choice of archetype influences LLM outputs and decision outcomes.
Defines and categorizes 17 human-LLM interaction archetypes to demonstrate how these archetypes impact LLM outputs and decisions in human-AI collaborative decision-making.
The paper addresses the computational inefficiency of evolutionary AI agents that repeatedly invoke LLMs by proposing AdaptEvolve, a framework for adaptive LLM selection during evolutionary refinement. AdaptEvolve uses intrinsic generation confidence to estimate real-time solvability and dynamically selects an LLM appropriate for the current generation step. Experiments demonstrate that confidence-driven selection achieves a better Pareto frontier, reducing inference costs by 37.9% while maintaining 97.5% of the accuracy of static large models.
Introduces AdaptEvolve, a novel adaptive LLM selection framework for evolutionary AI agents that leverages intrinsic generation confidence to dynamically choose the most efficient LLM for each generation step.
The paper addresses the challenge of sparse rewards in Reinforcement Learning for GUI agents by introducing Adaptive Milestone Reward (ADMIRE), a mechanism that dynamically distills milestones from successful explorations to provide verifiable, adaptive rewards. ADMIRE employs an asymmetric credit assignment strategy to denoise successful trajectories and scaffold failed ones, effectively balancing reward fidelity and density. Experiments on AndroidWorld demonstrate over 10% improvement in success rate across different base models, with strong generalizability observed in web navigation and embodied tasks.
Introduces ADMIRE, an adaptive milestone reward mechanism with asymmetric credit assignment, to improve temporal credit assignment in long-horizon GUI agent tasks.
The paper introduces PhyNiKCE, a neurosymbolic agentic framework that addresses the limitations of LLMs in autonomous CFD by decoupling neural planning from symbolic validation. PhyNiKCE uses a Symbolic Knowledge Engine to enforce physical constraints via a Deterministic RAG Engine, treating simulation setup as a Constraint Satisfaction Problem. Experiments using OpenFOAM and Gemini-2.5-Pro/Flash demonstrate a 96% improvement over baselines, a 59% reduction in self-correction loops, and a 17% decrease in LLM token consumption.
Introduces PhyNiKCE, a neurosymbolic framework that integrates neural planning with symbolic constraint enforcement to improve the reliability and efficiency of autonomous CFD agents.
This paper introduces Talk2DM, a plug-and-play module designed to enhance vehicle-road-cloud dynamic map (VRC-DM) systems with natural language querying and commonsense reasoning capabilities. To facilitate this, the authors created VRCsim, a VRC cooperative perception simulation framework, and VRC-QA, a question-answering dataset focused on spatial reasoning in mixed-traffic scenarios. Talk2DM leverages a novel chain-of-prompt (CoP) mechanism to integrate human-defined rules with LLM knowledge, achieving high accuracy and reasonable response times with models like Qwen3:8B, Gemma3:27B, and GPT-oss.
Introduces a chain-of-prompting method (CoP) that enables LLMs to effectively query and reason about dynamic maps by combining human-defined rules with the LLM's inherent commonsense knowledge.
This paper introduces a task planning framework that integrates Learning-Informed Object Search (LIOS) actions into high-level planning to address scenarios with missing objects. The framework models LIOS actions as deterministic, leveraging model-based calculations to estimate their cost and interleave search and execution steps. The approach demonstrates effective task planning with uncertainty, outperforming both non-learned and learned baselines in simulated ProcTHOR environments and real-world experiments involving retrieval and meal preparation tasks.
Introduces a novel planning framework that integrates learning-informed object search (LIOS) actions into task planning, enabling effective handling of missing objects by interleaving search and execution.
The paper introduces Execute-Summarize (ES), a framework that decouples task execution from workflow construction in LLMs, addressing the challenge of accurately translating LLM reasoning into structured workflows. ES first completes the task using available tools and then independently reconstructs a structured workflow from execution traces. Experiments on the newly introduced FlowBench demonstrate that ES outperforms existing methods, establishing a more reliable paradigm for grounding free-form LLM reasoning into structured workflows.
Introduces Execute-Summarize (ES), a novel framework that decouples task execution and workflow construction to improve the accuracy and robustness of structured workflow generation from LLM reasoning.
The paper introduces AIR, an incident response framework for LLM agents that enables autonomous detection, containment, and recovery from failures. AIR uses a domain-specific language integrated into the agent's execution loop to perform semantic checks, guide recovery actions, and synthesize guardrail rules. Experiments across three agent types demonstrate that AIR achieves over 90% success rates in detection, remediation, and eradication, highlighting the importance of incident response for agent safety.
Introduces AIR, a novel incident response framework for LLM agents, enabling autonomous management of the incident lifecycle.
The paper investigates how to best pretrain small language models (SLMs) to decide which tokens to predict directly versus delegating to an external source via a special token. They find that loss alone is insufficient for determining optimal delegation, as some high-loss tokens represent acceptable alternative continuations. They introduce LaCy, a pretraining method that uses a spaCy grammar parser to augment the loss signal, enabling SLMs to learn when to delegate and resulting in improved FactScore in cascaded generation setups compared to other methods.
Introduces LaCy, a pretraining method that leverages a spaCy grammar parser to augment the loss signal, enabling SLMs to learn when to delegate token prediction to an external source.
The paper introduces SIGHT, a reinforcement learning framework designed to improve search-based reasoning in LLMs by mitigating redundancy and noise in search results. SIGHT uses Self-Evidence Support (SES) to distill search results into high-fidelity evidence and employs an Information Gain score to identify pivotal states for Dynamic Prompting Interventions like de-duplication and adaptive branching. By integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT achieves superior performance on single-hop and multi-hop QA benchmarks with fewer search steps compared to existing methods.
Introduces a novel reinforcement learning framework, SIGHT, that leverages self-evidence support and information-gain driven diverse branching to enhance search-based reasoning in LLMs.
This paper introduces a spectrum framework for polycentric digital ecosystems, conceptualizing them as nested socio-technical systems across personal, organizational, inter-organizational, and global layers. It addresses the increasing need for resilient digital collaboration amidst geopolitical and technological fragmentation. The framework highlights how AI and automation, blockchain trust, federated data spaces, and immersive technologies can orchestrate digital integration in these ecosystems.
Introduces a multi-layered framework for polycentric digital ecosystems to facilitate collaboration in fragmented environments.
This paper introduces Differentiable Modal Logic (DML) implemented via Modal Logical Neural Networks (MLNNs) to enable multi-agent systems to learn relationships like trust networks and causal chains from behavioral data. DML addresses the limitations of traditional modal logic, which requires manual specification of relationship structures. The authors demonstrate a neurosymbolic debugging framework across epistemic, temporal, deontic, and doxastic modalities, showing how logical contradictions can be formulated as learnable optimization objectives in scenarios ranging from diplomacy games to LLM hallucination detection.
Introduces Differentiable Modal Logic (DML) and Modal Logical Neural Networks (MLNNs) to learn interpretable relationship structures in multi-agent systems directly from data, replacing manual specification.
The paper introduces a framework for intelligent AI delegation, enabling AI agents to decompose complex tasks and delegate sub-components to other AI agents or humans. This framework addresses limitations in current task decomposition methods by incorporating elements like authority transfer, accountability, and trust-building. The authors propose an adaptive approach applicable to both AI and human agents within complex delegation networks, contributing to the development of protocols for agentic systems.
Proposes a novel adaptive framework for intelligent AI delegation that incorporates key elements of human delegation such as authority transfer, accountability, and trust.
This paper investigates the impact of communication delays on cooperation in LLM-based multi-agent systems using a Continuous Prisoner's Dilemma. The authors introduce the FLCOA framework to emphasize the importance of lower-layer factors like communication resources in multi-agent cooperation. Their simulations reveal a U-shaped relationship between communication delay and mutual cooperation, where increased delay initially leads to exploitation but excessive delay reduces exploitation cycles.
Demonstrates that communication delays in LLM-based multi-agent systems can significantly impact cooperation, leading to exploitation and a non-monotonic relationship between delay magnitude and mutual cooperation.
This paper introduces MalTool, a framework leveraging coding LLMs to automatically generate malicious tools that can compromise user security and privacy when used by LLM agents. The authors propose a taxonomy of malicious tool behaviors based on the CIA triad and use MalTool to synthesize both standalone malicious tools and real-world tools with embedded malicious behaviors. Experiments demonstrate MalTool's effectiveness in generating malicious tools, even with safety-aligned coding LLMs, and reveal the limitations of existing detection methods, underscoring the need for improved defenses.
Introduces MalTool, a novel framework for automatically generating malicious tools using coding LLMs, enabling a systematic study of malicious tool code implementations and their impact on LLM agent security.
This paper introduces ImagineAgent, a framework that uses cognitive reasoning and generative imagination to improve Open-Vocabulary Human-Object Interaction (OV-HOI) comprehension. ImagineAgent constructs cognitive maps to model relationships between entities and actions, and uses retrieval augmentation, image cropping, and diffusion models to gather knowledge and visual evidence. Experiments on SWIG-HOI and HICO-DET show state-of-the-art performance with significantly less training data.
Introduces ImagineAgent, a novel agentic framework that leverages cognitive maps and generative tools to enhance OV-HOI comprehension by mitigating cross-modal hallucinations and occlusion ambiguity.
The paper introduces CM2, a reinforcement learning framework that utilizes checklist rewards instead of verifiable outcome rewards to train agents for multi-turn, multi-step tool use. CM2 decomposes each turn's behavior into fine-grained binary criteria with evidence grounding, enabling more stable classification-style reward signals. Experiments in an LLM-simulated tool environment demonstrate that CM2 significantly outperforms supervised fine-tuning baselines on benchmarks like tau^-Bench, BFCL-V4, and ToolSandbox, achieving comparable or superior performance to similarly sized open-source models.
This paper introduces a novel reinforcement learning framework, CM2, that replaces traditional verifiable rewards with checklist-based rewards for training agents to effectively use tools in multi-turn, multi-step interactions.
This paper introduces Counterfactual Conditional Likelihood (CCL) rewards to address redundant exploration in multiagent systems by scoring each agent's unique contribution to team exploration. CCL rewards agents for observations that are informative with respect to the joint exploration of the team, rather than solely for individual novelty. Experiments in continuous multiagent domains demonstrate that CCL accelerates learning in sparse reward environments requiring tight coordination.
Introduces Counterfactual Conditional Likelihood (CCL) rewards to incentivize efficient team exploration by rewarding agents based on their unique contribution to the team's joint exploration.
This paper introduces Adaptive-RF Transmission (ART), a communication-aware planning algorithm for multi-agent robotic exploration that modulates transmission location based on signal strength and data payload size. ART aims to improve coordination and efficiency in communication-limited environments by enabling heterogeneous robot teams to share information without excessive backtracking. Simulation results across cave-inspired environments show that ART and its extension, ART-SST, outperform existing strategies, achieving significant reductions in distance traveled and exploration time.
Introduces a novel communication-aware planning algorithm, Adaptive-RF Transmission (ART), that dynamically adjusts transmission location based on signal strength and data payload size for efficient multi-agent robotic exploration.
This paper introduces a semi-automated pipeline for extracting Subject-Predicate-Object triplets from financial reports using LLMs, addressing the lack of ground truth data by employing ontology-driven proxy metrics like Ontology Conformance and Faithfulness. The authors compare a manually engineered ontology with a document-specific, automatically induced ontology, finding that the latter achieves 100% schema conformance and eliminates ontology drift. They also propose a hybrid verification strategy combining regex matching and LLM-as-a-judge to reduce subject hallucination rates, and identify asymmetries in subject/object hallucinations.
Introduces a semi-automated pipeline for LLM-based triplet extraction from financial reports evaluated using ontology-driven proxy metrics, circumventing the need for annotated ground truth.
The paper introduces WebTestPilot, an LLM-based agent for end-to-end web testing against natural language specifications that addresses the challenges of implicit oracle inference and probabilistic reasoning. WebTestPilot uses a symbolization layer to represent GUI elements as symbols and translates natural language into step-by-step instructions with inferred pre- and post-conditions over these symbols, effectively capturing data, temporal, and causal dependencies for validation. Experiments on a new benchmark of bug-injected web applications demonstrate that WebTestPilot achieves a 99% task completion rate with 96% precision and 96% recall in bug detection, significantly outperforming existing LLM-based approaches.
Introduces a novel approach to end-to-end web testing by inferring oracles with symbolized GUI elements, enabling the agent to validate implicit requirements and improve bug detection accuracy.
The paper introduces Any House Any Task (AHAT), a household task planner designed for long-horizon planning in large environments with ambiguous instructions. AHAT trains an LLM to map task instructions and textual scene graphs into PDDL subgoals, which are then solved using symbolic reasoning for optimal plan generation. To improve decomposition of complex intentions, they propose TGPO, a reinforcement learning algorithm integrating external correction of intermediate reasoning traces into Group Relative Policy Optimization (GRPO), leading to significant performance gains.
Introduces a novel household task planner, AHAT, that leverages LLMs and symbolic reasoning with a new reinforcement learning algorithm, TGPO, to achieve superior long-horizon planning performance in complex, ambiguous environments.
This paper proposes a meta-cognitive architecture for AI-driven cybersecurity systems to address limitations in accountable decision-making under adversarial uncertainty. The architecture coordinates heterogeneous AI agents responsible for detection, hypothesis formation, explanation, and governance through an explicit meta-cognitive judgement function. By embedding meta-cognitive judgement as a first-class system function, the framework aims to make the cognitive structure of security operations explicit and governable, shifting the focus from optimizing isolated predictions to governing autonomy under uncertainty.
Introduces a meta-cognitive architectural framework for cybersecurity AI that explicitly governs decision readiness and dynamically calibrates system autonomy under uncertainty by coordinating heterogeneous AI agents through a meta-cognitive judgement function.
The paper introduces AmbiBench, a new benchmark designed to evaluate mobile GUI agents' ability to handle ambiguous instructions and engage in interactive intent alignment, moving beyond the limitations of existing benchmarks that focus on one-shot, complete instructions. The benchmark is structured around a taxonomy of instruction clarity levels (Detailed, Standard, Incomplete, Ambiguous) based on Cognitive Gap theory and includes 240 real-world tasks across 25 applications. The authors also present MUSE, an automated evaluation framework using an MLLM-as-a-judge multi-agent architecture, demonstrating its utility in assessing agent performance across different clarity levels and its correlation with human judgment.
Introduces AmbiBench, a novel benchmark for evaluating mobile GUI agents on their ability to handle ambiguous instructions and engage in interactive intent alignment, along with an automated evaluation framework called MUSE.
The paper introduces LawThinker, a legal reasoning agent designed to improve the accuracy and procedural compliance of legal reasoning in dynamic environments. LawThinker employs an Explore-Verify-Memorize strategy, integrating a DeepVerifier module to assess knowledge accuracy, fact-law relevance, and procedural compliance after each knowledge exploration step. Experiments on the J1-EVAL benchmark demonstrate a 24% improvement over direct reasoning and an 11% improvement over workflow-based methods, along with strong generalization across three static benchmarks.
Introduces an Explore-Verify-Memorize strategy with a DeepVerifier module to enforce verification as an atomic operation after each knowledge exploration step in legal reasoning.
This paper introduces the GUI Agent Autonomy Levels (GAL) framework, a six-level scale for classifying the autonomy of GUI agents interacting with software. The framework aims to clarify the varying degrees of autonomy currently attributed to GUI agents, addressing ambiguity in capability, responsibility, and risk. By providing a standardized benchmark, GAL facilitates progress towards more trustworthy software interaction.
Proposes the GUI Agent Autonomy Levels (GAL) framework to categorize and benchmark the autonomy of GUI agents.
The paper addresses the "Shallow Exploration Trap" in in-context learning, where autoregressive models struggle to generate long reasoning trajectories needed for broader state coverage. They propose Length-Incentivized Exploration (LIE), a reinforcement learning approach that rewards longer reasoning trajectories while penalizing redundancy. Experiments on Qwen3 and Llama models demonstrate that LIE improves in-context exploration, leading to performance gains of 4.4% on in-domain and 2.7% on out-of-domain tasks.
Introduces Length-Incentivized Exploration (LIE), a novel reinforcement learning method to encourage longer and more diverse reasoning trajectories in in-context learning by rewarding length and penalizing redundancy.
This paper introduces INTENT, a novel inference-time planning framework for budget-constrained, tool-augmented LLMs that addresses the challenge of costly tool use in sequential decision-making. INTENT uses an intention-aware hierarchical world model to anticipate future tool usage and risk-calibrated costs, enabling more effective online decision-making. Experiments on a cost-augmented StableToolBench demonstrate that INTENT achieves superior task success while strictly adhering to budget constraints, even under dynamic market conditions.
Introduces INTENT, an inference-time planning framework that leverages an intention-aware hierarchical world model for budget-constrained tool use in LLMs.
The paper introduces STAR, a framework for predicting large language model performance from limited data by combining statistical methods with agentic reasoning. STAR uses specialized retrievers for external knowledge and embeds semantic features into Constrained Probabilistic Matrix Factorization (CPMF) to generate statistical expectations with uncertainty. A reasoning module based on Expectation Violation Theory (EVT) then refines these predictions, achieving a 14.46% improvement over statistical baselines under extreme data sparsity.
Introduces a hybrid framework, STAR, that integrates statistical expectations with agentic reasoning to improve LLM performance prediction, particularly under data sparsity.
The paper introduces Trajectory-Search Rollouts (TSR), a training-time method that uses lightweight tree search to improve the quality of rollouts in multi-turn reinforcement learning for LLM agents. TSR selects high-scoring actions at each turn during rollout generation using task-specific feedback, leading to more informative training trajectories. Experiments on Sokoban, FrozenLake, and WebShop demonstrate that TSR, when combined with PPO and GRPO, achieves up to 15% performance gains and more stable learning.
Introduces a novel training-time trajectory generation method, TSR, that leverages lightweight tree search to construct higher-quality rollouts for multi-turn RL of LLM agents.
The paper introduces Gaia2, a benchmark designed to evaluate LLM agents in dynamic, asynchronous environments where the environment evolves independently of agent actions. Gaia2 features scenarios requiring agents to handle temporal constraints, adapt to noisy events, resolve ambiguity, and collaborate, coupled with write-action verifiers for fine-grained evaluation. Evaluations of state-of-the-art models reveal trade-offs between reasoning, efficiency, and robustness, with GPT-5 achieving the highest overall score (42% pass@1) but struggling with time-sensitive tasks.
Introduces Gaia2, a novel benchmark for evaluating LLM agents in realistic, asynchronous environments with action-level verification.
The paper introduces LAVES, a hierarchical LLM-based multi-agent system, to generate high-quality instructional videos from educational problems by decomposing the generation workflow into specialized agents for problem-solving, visualization, and narration. LAVES addresses limitations of end-to-end video generation models in scenarios requiring logical rigor and precise knowledge representation. The system achieves a throughput of over one million videos per day with a 95% cost reduction compared to industry standards, while maintaining a high acceptance rate, by constructing a structured executable video script compiled into synchronized visuals and narration.
Introduces a hierarchical LLM-based multi-agent system (LAVES) that decomposes educational video generation into specialized agents, enabling automated end-to-end production with high throughput and cost efficiency.
The paper introduces StateLM, a language model architecture with an internal reasoning loop and memory management tools (context pruning, document indexing, note-taking) that allows it to actively manage its own context. This addresses the limitation of standard LLMs that passively accept a fixed, manually engineered context. Experiments demonstrate that StateLM outperforms standard LLMs on long-document QA, chat memory, and complex research tasks, achieving significant accuracy improvements.
Empowers language models to actively manage their own context by introducing an internal reasoning loop and memory management tools, enabling stateful reasoning.
The paper introduces GameDevBench, a new benchmark for evaluating multimodal agents in game development, a domain requiring complex code manipulation and multimodal asset handling. The benchmark comprises 132 tasks derived from tutorials, demanding significantly more code and file changes compared to existing software development benchmarks. Experiments reveal that current agents struggle with game development tasks, particularly those involving 2D graphics, but performance can be improved by incorporating image and video-based feedback mechanisms.
Introduces GameDevBench, a novel benchmark designed to evaluate and advance multimodal agents in the challenging domain of game development.
This paper examines the shift in software engineering roles due to LLMs' code generation capabilities, arguing that system architecture is becoming the primary unit of engineering value. It uses case studies from the development of two systems, *Gaari* and *The Trail*, to illustrate how the engineering bottleneck is moving from syntax to system design. The paper concludes that modern engineers must transition to a "System Architect" model focused on logic and architecture.
Argues that the core engineering value in LLM-driven development is shifting from syntax to system architecture, requiring engineers to adopt a "System Architect" mindset.
The paper introduces ArtisanGS, an interactive tool suite for selecting and segmenting 3D Gaussian Splats (3DGS) to enable controllable editing of in-the-wild captures. It presents a fast AI-driven method for propagating user-guided 2D selection masks to 3DGS selections, supplemented by manual selection and segmentation tools for user intervention. The toolset's utility is demonstrated through user-guided local editing using a custom Video Diffusion Model, achieving binary segmentation of unstructured 3DGS scenes without additional optimization.
Introduces an interactive tool suite, ArtisanGS, for versatile Gaussian Splat selection and segmentation, enabling user-guided editing via a novel AI-driven propagation method and manual tools.
This paper investigates the "self-evolution trilemma" in multi-agent LLM systems, demonstrating the impossibility of simultaneously achieving continuous self-evolution, complete isolation, and safety invariance. Using an information-theoretic framework, the authors formalize safety as the divergence from anthropic value distributions and prove that isolated self-evolution leads to statistical blind spots, causing irreversible safety degradation. Empirical results from the Moltbook agent community and two closed self-evolving systems validate the theoretical prediction of inevitable safety erosion, highlighting the need for external oversight.
Proves the impossibility of simultaneously achieving continuous self-evolution, complete isolation, and safety invariance in multi-agent LLM systems, formalizing this as the "self-evolution trilemma."
The paper introduces Dreaming in Code (DiCode), a framework that uses foundation models to generate executable environment code variations for curriculum learning in open-ended environments. DiCode addresses the challenge of discovering learnable sequences of experiences in complex environments by "dreaming" code-level variations of the world to scaffold learning. Experiments in the Craftax environment demonstrate that DiCode enables agents to acquire long-horizon skills, achieving a 16% improvement in mean return over the strongest baseline and success on late-game combat tasks where prior methods fail.
Introduces DiCode, a novel framework leveraging foundation models to synthesize executable environment code for curriculum learning, enabling agents to acquire complex skills in open-ended environments.
This paper introduces MemFly, a framework for on-the-fly memory optimization in LLMs based on the information bottleneck principle. MemFly uses a gradient-free optimizer to minimize compression entropy while maximizing relevance entropy, creating a stratified memory structure. The framework incorporates a hybrid retrieval mechanism combining semantic, symbolic, and topological pathways, achieving superior performance in memory coherence, response fidelity, and accuracy compared to existing methods.
Introduces an information bottleneck-based framework, MemFly, for on-the-fly memory optimization in LLMs, enabling efficient compression and precise retrieval.
This paper investigates the applicability of attribution-based explainability methods, commonly used for static classification tasks, to agentic AI systems where behavior emerges over multi-step trajectories. The authors compare attribution-based explanations with trace-based diagnostics in both static classification and agentic benchmarks (TAU-bench Airline and AssistantBench). They find that attribution methods, while stable in static settings, are unreliable for diagnosing execution-level failures in agentic trajectories, whereas trace-grounded rubric evaluation effectively localizes behavior breakdowns.
Demonstrates the limitations of applying attribution-based explainability methods designed for static predictions to agentic AI systems and advocates for trajectory-level explainability.
This paper introduces SparseVideoNav, a novel approach to Beyond-the-View Navigation (BVN) that leverages video generation models to enable agents to navigate using only high-level intents. The key insight is that video generation models inherently benefit from long-horizon supervision, making them well-suited for BVN tasks where agents must locate distant, unseen targets. By generating sparse future trajectories spanning a 20-second horizon, SparseVideoNav achieves a 27x speed-up compared to unoptimized video generation, resulting in a 2.5x improvement in success rate over LLM baselines in real-world zero-shot experiments, including challenging night scenes.
Introduces SparseVideoNav, a novel framework integrating video generation for efficient and effective beyond-the-view vision-language navigation.
The paper introduces AIDE, a dual-stream framework for robots to execute ambiguous instructions in unfamiliar environments by interactively identifying task-relevant objects. AIDE uses Multi-Stage Inference (MSI) for decision-making and Accelerated Decision-Making (ADM) for execution, enabling zero-shot affordance analysis and instruction interpretation. Experiments demonstrate AIDE achieves over 80% task planning success and 95% accuracy in closed-loop execution at 10 Hz, surpassing existing VLM-based methods.
Introduces a dual-stream framework, AIDE, that integrates interactive exploration with vision-language reasoning for robots to execute ambiguous instructions through zero-shot affordance analysis.

