Forget finetuning – Kumiho's graph-native memory lets you swap in a better LLM and instantly double your agent's reasoning accuracy on complex cognitive tasks.

Young Bin Park, Young-Bin Park

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

All Papers (100)

Mar 18, 2026

2w ago

From Digital Twins to World Models:Opportunities, Challenges, and Applications for Mobile Edge General Intelligence

Ditching rigid digital twins for adaptable world models could unlock truly intelligent edge computing in 6G networks.

Dusit Niyato, Changyuan Zhao, Jiawen Kang +1

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Ruixiao Shi +32w ago

A Creative Agent is Worth a 64-Token Template

Unleash creativity in text-to-image models with a single, reusable 64-token template, sidestepping costly iterative prompt engineering and reasoning.

Ruixiao Shi, Fu Feng, Yucheng Xie +1

Computer Vision Multimodal Models Tool Use & Agents

Qianpu Chen +32w ago

In Trust We Survive: Emergent Trust Learning

Forget complex communication protocols – this trust-based algorithm lets agents learn to cooperate in competitive environments with minimal overhead.

Qianpu Chen, Giulio Barbero, Mike Preuss +1

Robotics & Embodied AI Tool Use & Agents

Michel Schimpf +22w ago

AI-Assisted Goal Setting Improves Goal Progress Through Social Accountability

AI career coaches can boost short-term goal progress not just through reflection, but by making users feel more socially accountable.

Michel Schimpf, Julian Voigt, Thomas Bohné

Natural Language Processing Tool Use & Agents

Young Bin Park +12w ago

Graph-Native Cognitive Memory for AI Agents: Formal Belief Revision Semantics for Versioned Memory Architectures

Forget finetuning – Kumiho's graph-native memory lets you swap in a better LLM and instantly double your agent's reasoning accuracy on complex cognitive tasks.

Young Bin Park, Young-Bin Park

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

2w ago

A Unified Language Model for Large Scale Search, Recommendation, and Reasoning

Forget tool-augmented systems: NEO shows you can consolidate search, recommendation, and reasoning into a single language-steerable LLM by representing items as SIDs and interleaving them with natural language.

Marco De Nadai, Edoardo D'Amico, Max Lefarov +23

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

2w ago

Proactive Knowledge Inquiry in Doctor-Patient Dialogue: Stateful Extraction, Belief Updating, and Path-Aware Action Planning

Instead of passively transcribing doctor-patient dialogues, this system actively models what's known, what's missing, and what questions to ask next, paving the way for more intelligent EMR systems.

Zhenhai Pan, Yan Liu, Jia You

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Stanford HAI2w ago

ReSteer: Quantifying and Refining the Steerability of Multitask Robot Policies

Robots often ignore your commands mid-task, but ReSteer offers a way to fix this by pinpointing and patching the "blind spots" in their training data.

Zhenyang Chen, Alan Tian, Alan Tian +12

Eval Frameworks & Benchmarks Robotics & Embodied AI Tool Use & Agents

2w ago

AERR-Nav: Adaptive Exploration-Recovery-Reminiscing Strategy for Zero-Shot Object Navigation

Robots can now nimbly navigate complex, multi-floor environments without prior training, thanks to a new strategy that dynamically switches between exploration, recovery, and memory recall.

Jingzhi Huang, Jing Huang, Junkai Huang +4

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Amazon Science2w ago

LAAF: Logic-layer Automated Attack Framework A Systematic Red-Teaming Methodology for LPCI Vulnerabilities in Agentic Large Language Model Systems

Agentic LLMs are surprisingly vulnerable: a new framework finds successful attacks in 84% of attempts by escalating prompt injection techniques across multiple stages.

Hammad Atta, Hammad Atta, Ken Huang +25

Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness Tool Use & Agents

Jiashun Liu +12w ago

Complementary Reinforcement Learning

RL agents can learn far more efficiently by dynamically distilling and leveraging past experiences that co-evolve with the agent's growing capabilities.

Jiashun Liu, Bo Zheng

RLHF & Preference Learning Tool Use & Agents Training Efficiency & Optimization

2w ago

A Multi-Agent System for Building-Age Cohort Mapping to Support Urban Energy Planning

A multi-agent LLM system can fuse heterogeneous data sources to accurately classify building ages from satellite imagery, enabling better urban energy planning despite class imbalances in historical building cohorts.

Kundan Thota, Thorsten Schlachter, Veit Hagenmeyer

Natural Language Processing Tool Use & Agents

Hao Ma +32w ago

Efficient Soft Actor-Critic with LLM-Based Action-Level Guidance for Continuous Control

LLMs can act as effective action-level supervisors in reinforcement learning, dramatically boosting the sample efficiency of SAC without sacrificing convergence guarantees.

Hao Ma, Zhiqiang Pu, Xiaolin Ai +1

RLHF & Preference Learning Robotics & Embodied AI Tool Use & Agents

Xinyang Gong +52w ago

ShuttleEnv: An Interactive Data-Driven RL Environment for Badminton Strategy Modeling

Forget rigid physics engines, this badminton RL environment uses real player data to simulate realistic rallies and strategic gameplay.

Xinyang Gong, Bozhou Chen, Yunlong Lu +3

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Aivo Olev +22w ago·also TalTech

Multi-Source Evidence Fusion for Audio Question Answering

Grounding LALM reasoning in diverse, reliability-weighted acoustic evidence blows away the competition in Audio Question Answering, proving that verifiable chains beat black boxes.

Aivo Olev, Tanel Alumäe, Tanel Alumae

Reasoning & Chain-of-Thought Speech & Audio Tool Use & Agents

Universidad ORT Uruguay2w ago

TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

Simply prompting for test-driven development can *increase* regressions in AI coding agents; instead, focus on surfacing contextual information about which tests are most relevant.

Pepe Alonso

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

CMU ML2w ago·also NII

RPMS: Enhancing LLM-Based Embodied Planning through Rule-Augmented Memory Synergy

LLMs in embodied environments get a massive boost from structured rules, with rule retrieval alone contributing +14.9 pp to single-trial success.

Zhenhang Yuan, Shenghai Yuan, Lihua Xie

Robotics & Embodied AI Tool Use & Agents World Models & Planning

2w ago

Differential Privacy in Generative AI Agents: Analysis and Optimal Tradeoffs

Forget prompt privacy – your LLM's responses are leaking *enterprise data*, and this paper shows how to quantify and control it.

Ya-Ting Yang, Quanyan Zhu

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Stanford HAI2w ago

Deployment and Evaluation of an EHR-integrated, Large Language Model-Powered Tool to Triage Surgical Patients

Automating surgical patient triage with an LLM achieves 94% sensitivity, but discrepancies reveal more about clinical workflow gaps than AI errors.

Janelle B. Wang, T. Keyes, April S. Liang +11

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

2w ago

GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System

Forget training wheels: GoalVLM lets multi-agent robots navigate to any object you describe, no pre-programmed categories needed.

MoniJesu James, M. James, Amir Atef Habel +5

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Hamed Taheri2w ago

Governed Memory: A Production Architecture for Multi-Agent Workflows

Enterprise AI can achieve 50% token reduction and zero cross-entity leakage by implementing a shared, governed memory architecture for multi-agent workflows.

Hamed Taheri

Architecture Design (Transformers, SSMs, MoE)Constitutional AI & AI Ethics Tool Use & Agents

2w ago

Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety

Current LLM agent safety benchmarks are missing over 20% of unsafe behaviors, even after agents pass the benchmark.

Xuan Chen, Lu Yan, Ruqi Zhang +1

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Akshey Sigdel +12w ago

Guardrails as Infrastructure: Policy-First Control for Tool-Orchestrated Workflows

Tool-using agents are failing in predictable ways, but a model-agnostic policy layer can measurably improve their safety and reliability, albeit with a clear utility tradeoff.

Akshey Sigdel, Rista Baral

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Sriram Gopalakrishnan2w ago

Don't Vibe Code, Do Skele-Code: Interactive No-Code Notebooks for Subject Matter Experts to Build Lower-Cost Agentic Workflows

Forget complex multi-agent systems: Skele-Code's no-code interface slashes token costs by shifting agent involvement to code generation only, enabling subject matter experts to build agentic workflows directly.

Sriram Gopalakrishnan

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Hadil Ben Amor +42w ago

MLmisFinder: A Specification and Detection Approach of Machine Learning Service Misuses

Despite the ease of integrating ML cloud services, developers are widely misusing them, leading to quality and maintainability issues that MLmisFinder can now automatically detect with high accuracy.

Hadil Ben Amor, Niruthiha Selvanayagam, Manel Abdellatif +2

Code Generation & Program Synthesis Tool Use & Agents

2w ago

Bootstrapping Coding Agents: The Specification Is the Program

Forget about chasing the perfect model architecture – this work suggests the real key to better AI agents lies in crafting more precise and complete specifications, since the implementation can always be re-generated.

M. Monperrus, Martin Monperrus

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Akshat Rana +32w ago

SG-CoT: An Ambiguity-Aware Robotic Planning Framework using Scene Graph Representations

Scene graphs plus LLMs let robots ask clarifying questions, boosting multi-agent task success by 15%.

Akshat Rana, Peeyush Agarwal, Krishan Rana +1

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

2w ago·also Northeastern, Punch Cyber Analytics

Retrieval-Augmented LLMs for Security Incident Analysis

LLMs armed with RAG can reconstruct cyberattacks with high precision and recall, but the best model for the job depends on your budget: DeepSeek V3 matches Claude Sonnet 4's accuracy at 1/15th the cost.

Xavier Cadet, Xavier Cadet, Aditya Vikram Singh +14

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Yi Yu +62w ago·also Fudan

Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise Reduction

Achieve SOTA LLM alignment in complex technical domains with a fraction of the compute by distilling knowledge into smaller models using a hybrid reward mechanism and targeted data augmentation.

Yi Yu, Junzhuo Ma, Chenghuang Shen +4

Natural Language Processing Tool Use & Agents Training Efficiency & Optimization

School of Mechanical Engineering2w ago·also ASU

Access Controlled Website Interaction for Agentic AI with Delegated Critical Tasks

Fine-grained access control for websites can finally enable safe and reliable delegation of critical tasks to AI agents.

Sunyoung Kim, Hokeun Kim

Constitutional AI & AI Ethics Tool Use & Agents

Joohyoung Jeon +12w ago

Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization

LLM-powered trading agents can still achieve a Sharpe ratio of 1.40 even when completely blindfolded to ticker symbols and company names, suggesting genuine understanding of market dynamics.

Joohyoung Jeon, Hongchul Lee

Eval Frameworks & Benchmarks Tool Use & Agents

2w ago

Retrieval-Augmented LLM Agents: Learning to Learn from Experience

Retrieval-augmented LLM agents can learn to learn from experience, achieving significantly better generalization on unseen tasks by combining the strengths of fine-tuning and in-context retrieval.

Thomas Palmeira Ferraz, Romain Deffayet, Vassilina Nikoulina +2

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Philipp Normann +42w ago

Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards

A 4B parameter model can nearly match the privilege escalation performance of a state-of-the-art closed LLM like Claude Opus, while being fully local and 100x cheaper to run.

Philipp Normann, Andreas Happe, A. Happe +2

Code Generation & Program Synthesis Open-Source Models & Weights Red-Teaming & Adversarial Robustness+1

Alexander V. Shenderuk-Zhidkov +22w ago

Large Language Models as a Semantic Interface and Ethical Mediator in Neuro-Digital Ecosystems: Conceptual Foundations and a Regulatory Imperative

LLMs acting as semantic interfaces to our brains pose unprecedented ethical risks to mental autonomy and neurorights, demanding a new "second-order neuroethics."

Alexander V. Shenderuk-Zhidkov, A. E. Hramov, Alexander E. Hramov

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Yusen Wu +22w ago

MALLES: A Multi-agent LLMs-based Economic Sandbox with Consumer Preference Alignment

LLMs can be economically aligned to real-world consumer preferences via post-training on transaction data, enabling more accurate and stable economic simulations.

Yusen Wu, Yiran Liu, Xiaotie Deng

Tool Use & Agents World Models & Planning

Saikat Maiti2w ago

Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare

Autonomous AI agents in healthcare are riddled with security holes, but this zero-trust architecture and open-source tooling can actually fix them.

Saikat Maiti

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Yi Nian +22w ago

When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution

You can now audit multi-agent LLM systems and trace responsibility for harmful outputs even without access to internal execution logs, thanks to a clever "self-describing text" technique.

Yi Nian, Haosen Cao, Qingqing Luan

Interpretability & Mechanistic Interp Natural Language Processing Tool Use & Agents

Mohsen Arjmandi2w ago

Sensi: Learn One Thing at a Time -- Curriculum-Based Test-Time Learning for LLM Game Agents

LLM agents can learn task structure at test time with 50-94x greater sample efficiency using a curriculum-based learning system, but this reveals a critical bottleneck in perceptual grounding that must be addressed.

Mohsen Arjmandi

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

2w ago·also PolyU

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

Forget prompt engineering: AgentFactory lets LLM agents self-evolve by accumulating and refining executable Python subagents, making task re-execution more reliable and efficient.

Zhang Zhang, Shuqi Lu, Hongjin Qian +2

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Yuntong Zhang +22w ago·also Max-Planck Insitute of Security and Privacy

VeriGrey: Greybox Agent Validation

Grey-box fuzzing of LLM agents, guided by tool invocation sequences, reveals significantly more prompt injection vulnerabilities and malicious behaviors than black-box testing alone.

Yuntong Zhang, Sungmin Kang, Marcel Böhme

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Abhijeet Sahu +22w ago

Network and Device Level Cyber Deception for Contested Environments Using RL and LLMs

Forget static honeypots – LLMs and RL could make cyber deception dynamic and adaptive, turning the tables on attackers in contested environments.

Abhijeet Sahu, Shuva Paul, Rochard Macwan

Natural Language Processing Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago

Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

Symphony's cognitively-inspired multi-agent system significantly boosts long-form video understanding by mimicking human reasoning, achieving state-of-the-art results on multiple benchmarks.

Haiyang Yan, Hongyun Zhou, Xiaoxue Feng +1

Computer Vision Multimodal Models Tool Use & Agents

Yi Ting Shen +32w ago

MCP-38: A Comprehensive Threat Taxonomy for Model Context Protocol Systems (v1.0)

Existing threat models fail to capture the unique vulnerabilities of Model Context Protocol systems, but MCP-38 fills this gap with a comprehensive taxonomy of 38 distinct threat categories.

Yi Ting Shen, Kentaroh Toyoda, Alex Leung +1

Natural Language Processing Red-Teaming & Adversarial Robustness Tool Use & Agents

Mohamed Eltahir +52w ago

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Forget collapsing videos into text – this hierarchical grid lets you zoom into any moment with lossless visual fidelity, unlocking logarithmic compute scaling for long-form video understanding.

Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models+1

University of Limerick2w ago·also Cloud ERP -UX Foundation

A Contextual Help Browser Extension to Assist Digital Illiterate Internet Users

Digital literacy gaps shrink as a browser extension slashes information retrieval time by 87% using an AI-powered tooltip that defines technical acronyms on demand.

Christos Koutsiaris

Natural Language Processing Tool Use & Agents

CMU ML2w ago·also INSA Rennes

CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents

Forget specialized tools: a standard Unix terminal and clever RL are all you need to beat much larger LLMs at code search.

Lintang Sutawika, Aditya Bharat Soni, Bharath Sriraam R R +11

Code Generation & Program Synthesis Recommendation & Information Retrieval Tool Use & Agents

2w ago

Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs

Generalizing RL to continuous state and action spaces just got easier: this paper introduces an operator-theoretic framework and PPO-type algorithms that ditch finite-state assumptions.

Abhishek Gupta, Aditya Mahajan

Robotics & Embodied AI Tool Use & Agents

2w ago

Agentic Cognitive Profiling: Realigning Automated Alzheimer's Disease Detection with Clinical Construct Validity

LLMs can achieve state-of-the-art Alzheimer's detection by mimicking clinical cognitive assessment protocols, not just learning statistical patterns.

Jiawen Kang, Kun Li, Dongrui Han +5

Natural Language Processing Tool Use & Agents

CMU ML2w ago

OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms

LLMs can navigate complex 3D environments more effectively and with far fewer tokens by using a hierarchical scene graph representation derived from omnidirectional sensor data.

Zhongyuan Liu, Zhongyuang Liu, Min He +6

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Tsinghua AI2w ago

VeriAgent: A Tool-Integrated Multi-Agent System with Evolving Memory for PPA-Aware RTL Code Generation

LLMs can now generate Verilog code that's not just correct, but also optimized for real-world hardware constraints like power, performance, and area, thanks to a novel multi-agent system with evolving memory.

Yaoxiang Wang, Qi Shi, Qiaolin Shi +8

Code Generation & Program Synthesis Tool Use & Agents

2w ago·also Lenovo, PKU

AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

AdaZoom-GUI achieves SOTA GUI grounding by adaptively zooming in on small elements and refining ambiguous instructions, outperforming even larger models.

Siqi Pei, Liang Tang, Tiaonan Duan +7

Computer Vision Multimodal Models Tool Use & Agents

Zihao Xin +72w ago

AgentVLN: Towards Agentic Vision-and-Language Navigation

VLMs can now drive embodied agents to navigate complex environments with unprecedented efficiency, thanks to a novel framework that bridges the gap between 2D semantic understanding and 3D spatial reasoning.

Zihao Xin, Wentong Li, Yixuan Jiang +5

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Mar 17, 2026

Subrahmanyam Arunachalam2w ago·also Texas A&M, UT Dallas

Learning to Present: Inverse Specification Rewards for Agentic Slide Generation

A 7B model, fine-tuned with a novel inverse specification reward, can generate slide presentations rivaling those of much larger models, highlighting the importance of instruction adherence and tool use over raw parameter count.

Karthik Ragunath Ananda Kumar, Subrahmanyam Arunachalam

Natural Language Processing Tool Use & Agents

2w ago

VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

Current multimodal browsing agents are surprisingly bad at using visual information on webpages, with even top models scoring below 50% accuracy on a new visual-native search benchmark.

Zhengbo Zhang, Jinbo Su, Zhaowen Zhou +12

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

2w ago

Nonstandard Errors in AI Agents

Even when given identical data and research questions, autonomous AI coding agents exhibit surprisingly high variability in their empirical findings, raising concerns about the reliability of AI-driven research.

Ruijiang Gao, Steven Chong Xiao

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

2w ago·also China Academy of Space Technology, Harvard

When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

Stop wasting compute: this RL-trained orchestration policy adaptively decides when your embodied agent should reason with an LLM, slashing latency and boosting task success compared to fixed strategies.

Jun Liu, Pu Zhao, Zhenglun Kong +12

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

2w ago·also Kyvvu B.V.

Runtime Governance for AI Agents: Policies on Paths

Current AI agent governance methods are too static; runtime evaluation of execution paths is necessary for effective, path-dependent policy enforcement.

Maurits Kaptein, Vassilis-Javed Khan, Andriy Podstavnychy

Constitutional AI & AI Ethics Tool Use & Agents

Rebecca Ansell +12w ago

How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

LLMs can't crack Clue: even state-of-the-art models struggle with multi-step deductive reasoning in a simulated text-based game, and fine-tuning doesn't reliably help.

Rebecca Ansell, Autumn Toney-Wails

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

2w ago·also B and, Meituan

TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas

Even without pre-loaded database schemas, a new RL agent matches or beats state-of-the-art text-to-SQL models that have full schema access.

Ai Jian, Wanrou Du, Jingqing Ruan +2

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Tianzhu Ye +112w ago

Online Experiential Learning for Language Models

Language models can learn directly from real-world user interactions, boosting performance without human annotations or simulated environments.

Tianzhu Ye, Tianzhu Ye, Li Dong +9

Natural Language Processing RLHF & Preference Learning Tool Use & Agents

2w ago

Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM-Enabled Robots

User-facing guardrails for LLM-enabled robots can balance flexibility and safety by offering constrained choices and clear recourse, rather than open-ended value settings.

Carmen Ng

Constitutional AI & AI Ethics RLHF & Preference Learning Robotics & Embodied AI+1

2w ago·also CUHK

From Natural Language to Executable Option Strategies via Large Language Models

LLMs can now reliably translate natural language into executable option trading strategies, thanks to a new domain-specific language that constrains their output to verifiable semantic parses.

Haochen Luo, Zhengzhao Lai, Junjie Xu +2

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Alejandro Paredes La Torre2w ago

Adversarial attacks against Modern Vision-Language Models

Open-source VLMs can be easily fooled by simple gradient-based attacks, but the degree of vulnerability varies drastically across architectures.

Alejandro Paredes La Torre

Multimodal Models Red-Teaming & Adversarial Robustness Tool Use & Agents

Peng Zhang2w ago

RepoReviewer: A Local-First Multi-Agent Architecture for Repository-Level Code Review

RepoReviewer tackles the complexity of repository-level code review with a multi-agent architecture, breaking down the monolithic process into manageable stages for more relevant and efficient feedback.

Peng ZhangCode

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Tool Use & Agents

Guanghui Zhao +42w ago

Toward Reliable Scientific Visualization Pipeline Construction with Structure-Aware Retrieval-Augmented LLMs

Forget generic code generation – this work shows that structure-aware retrieval of domain-specific examples slashes the debugging needed to get LLMs to produce working scientific visualization pipelines.

Guanghui Zhao, Zhe Wang, Yu Dong +2

Code Generation & Program Synthesis Recommendation & Information Retrieval Tool Use & Agents

Jaechang Kim +52w ago

Visual Prompt Discovery via Semantic Exploration

Forget hand-crafted visual prompts – this framework automatically discovers counter-intuitive image manipulation strategies that dramatically boost LVLM perception.

Jaechang Kim, Yotaro Shimose, Zhao Wang +3

Computer Vision Multimodal Models Tool Use & Agents

C. Scheidemann +52w ago

Beyond Cybathlon: On-demand Quadrupedal Assistance for People with Limited Mobility

A quadrupedal robot can now provide on-demand assistance to wheelchair users, offering a more agile and less intrusive alternative to fixed robotic arms.

C. Scheidemann, Carmen Scheidemann, Andrei Cramariuc +3

Robotics & Embodied AI Tool Use & Agents

Mahdis Rabbani +22w ago

Asymmetric Nash Seeking via Best Response Maps: Global Linear Convergence and Robustness to Inexact Reaction Models

You can provably find Nash equilibria even when one player only knows the *reaction* of the other, not their full objective.

Mahdis Rabbani, Navid Mojahed, S. Nazari

Robotics & Embodied AI Tool Use & Agents

2w ago·also XJTU

SignNav: Leveraging Signage for Semantic Visual Navigation in Large-Scale Indoor Environments

Forget pre-built maps: this new navigation agent interprets signs like a human, achieving 80% success in complex indoor environments.

Jian Sun, Yuming Huang, Heyueyang Li +8

Computer Vision Robotics & Embodied AI Tool Use & Agents

City University of New York2w ago

Deep Reinforcement Learning-driven Edge Offloading for Latency-constrained XR pipelines

A novel DRL approach can extend XR device battery life by 163% without sacrificing real-time responsiveness, offering a practical solution to the energy-latency trade-off in immersive applications.

Sourya Saha, Saptarshi Debroy

Distributed Systems & Hardware Robotics & Embodied AI Tool Use & Agents

Omar Rayyan +12w ago

TeleDex: Accessible Dexterous Teleoperation

Forget expensive motion capture suits – TeleDex lets you teleoperate dexterous robots with just your phone.

Omar Rayyan, Maximilian Gillesm

Robotics & Embodied AI Tool Use & Agents

Che-Ming Chang +62w ago

CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization

A multi-agent system that mimics rubber-duck debugging slashes critical path delay by 25% and power consumption by 22% in RTL code, outperforming LLM-based baselines.

Che-Ming Chang, Prashanth Vijayaraghavan, A. Jadhav +4

Code Generation & Program Synthesis Tool Use & Agents

A. Liddo +42w ago

Human/AI Collective Intelligence for Deliberative Democracy: A Human-Centred Design Approach

Human-centered design can successfully integrate AI to support collective intelligence in deliberative democracy, offering a pathway to more trustworthy and inclusive democratic processes.

A. Liddo, Anna De Liddo, Lucas Anastasiou +2

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

2w ago

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

Coding agents struggle to maintain faithfulness to specifications that emerge gradually over long interactions, losing significant implementation fidelity compared to single-shot specifications.

Lu Yan, Xuan Chen, Xiangyu Zhang

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

2w ago

An End-to-End Framework for Functionality-Embedded Provenance Graph Construction and Threat Interpretation

LLMs can automate the creation of enriched provenance graphs from system logs, leading to more accurate and interpretable anomaly detection without manual rule engineering.

Kushankur Ghosh, Mehar Klair, Kian Kyars +2

Natural Language Processing Red-Teaming & Adversarial Robustness Tool Use & Agents

Trung V. Phan +22w ago

DeepStage: Learning Autonomous Defense Policies Against Multi-Stage APT Campaigns

By explicitly modeling attacker stages, DeepStage achieves significantly better defense performance against APTs than risk-aware baselines, suggesting that stage-aware reasoning is crucial for effective autonomous cyber defense.

Trung V. Phan, Tri Gia Nguyen, Thomas Bauschert

Red-Teaming & Adversarial Robustness Tool Use & Agents

Microsoft Research2w ago

Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents

AI-generated code's fluency masks a critical flaw: it often fails to deliver what users actually intend, highlighting the urgent need for "intent formalization" to bridge the gap between informal requirements and precise program behavior.

Shuvendu K. Lahiri

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Jiahua Hu +22w ago

Impacts of Electric Vehicle Charging Regimes and Infrastructure Deployments on System Performance: An Agent-Based Study

Smarter placement of slow chargers can significantly reduce the need for expensive en-route EV charging, leading to lower overall system costs.

Jiahua Hu, Hai L.Vu, W. Griggs

Tool Use & Agents World Models & Planning

Eason Chen +92w ago

When Openclaw Agents Learn from Each Other: Insights from Emergent AI Agent Communities for Human-AI Partnership in Education

AI agents are spontaneously converging on shared memory architectures that resemble open learner models, suggesting a natural path to collaborative learning systems.

Eason Chen, Ce Guan, Ahmed Elshafiey +7

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Caglar Yildirim2w ago

Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

Mental health disclosures in user profiles can *increase* LLM agent refusal rates on both harmful and benign tasks, revealing a fragile safety-utility trade-off easily overridden by jailbreaks.

Caglar Yildirim

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Benoît Alcaraz2w ago

What if Pinocchio Were a Reinforcement Learning Agent: A Normative End-to-End Pipeline

Reinforcement learning agents can now learn to be "good" (i.e., norm-compliant) via a novel pipeline that leverages argumentation-based normative advisors and automatically extracts the reasoning behind those norms.

Benoît Alcaraz

Constitutional AI & AI Ethics RLHF & Preference Learning Tool Use & Agents

2w ago

DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis

Document-level sentiment analysis gets a boost with DanceHA, a multi-agent framework that not only tackles the complexity of informal writing but also shows how agent knowledge can be distilled into more efficient student models.

Min Huang, Eduard Dragut

Natural Language Processing Tool Use & Agents

2w ago·also Adobe Research, Ohio State, State University of New York at Buffalo

Anticipatory Planning for Multimodal AI Agents

Multimodal agents can now plan more coherently and solve complex tasks thanks to a new anticipatory reasoning framework that forecasts short-horizon trajectories before acting.

Yongyuan Liang, Shijie Zhou, Yuxuan Gu +8

Multimodal Models Tool Use & Agents World Models & Planning

Utkarsh Pratiush +62w ago

Novelty-Driven Target-Space Discovery in Automated Electron and Scanning Probe Microscopy

Automated microscopy can now actively discover new scientific information by searching for diverse functional responses, rather than being limited to optimizing for known objectives.

Utkarsh Pratiush, Kamyar Barakati, Boris N. Slautin +4

Computer Vision Scientific Discovery & Drug Design Tool Use & Agents

2w ago·also TU Delft, Vrije Universiteit Amsterdam

Leveraging LLMs for Structured Information Extraction and Analysis from Cloud Incident Reports (Work In Progress Paper)

Lightweight LLMs like Gemini 2.0 and GPT-3.5 can extract key metadata from cloud incident reports with surprisingly high accuracy (75-95%), offering a cost-effective alternative to larger models.

Xiaoyu Chu, Shashikant Ilager, Yizhen Zang +2

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval+1

Lizheng Sun2w ago

MemX: A Local-First Long-Term Memory System for AI Assistants

Achieve 91%+ Hit@1 retrieval accuracy in a local-first long-term memory system for AI assistants by combining vector recall, keyword recall, RRF, and re-ranking, while maintaining sub-90ms search latency at scale.

Lizheng Sun

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Rui Ge +52w ago

Internalizing Agency from Reflective Experience

LLMs can learn to recover from mistakes more effectively by reflecting on past failures and internalizing actionable feedback, leading to significant gains in long-horizon problem-solving.

Rui Ge, Yichao Fu, Yuyang Qian +3

RLHF & Preference Learning Tool Use & Agents World Models & Planning

Ziyang Cai +12w ago

AI Scientist via Synthetic Task Scaling

Forget curated datasets – this work shows you can bootstrap AI scientists by training them on automatically generated, self-verified ML tasks, leading to significant performance gains on MLGym.

Ziyang Cai, Harkirat Singh Behl

Data Curation & Synthetic Data Scientific Discovery & Drug Design Tool Use & Agents

2w ago

ARISE: Agent Reasoning with Intrinsic Skill Evolution in Hierarchical Reinforcement Learning

ARISE lets language models solve math problems better by learning and reusing successful solution strategies, outperforming existing RL methods, especially on harder, out-of-distribution problems.

Yu Li, Rui Miao, Zhengling Qi +1

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

2w ago·also NII, SOKENDAI, The Graduate University for Advanced, UofT

Domain-Independent Dynamic Programming with Constraint Propagation

Constraint propagation can significantly boost dynamic programming by pruning states and transitions, but the overhead needs further optimization.

Imko Marijnissen, J. Christopher Beck, Emir Demirović +1

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

2w ago·also UIUC

Social Simulacra in the Wild: AI Agent Communities on Moltbook

AI-agent communities aren't just pale imitations of human ones; they're structurally and linguistically distinct, exhibiting extreme inequality and homogenization driven by identifiable agent-level stylistic outliers.

Agam Goyal, Olivia Pal, Hari Sundaram +2

Natural Language Processing Tool Use & Agents

Marcos Galdino +42w ago

A Human-Centred Architecture for Large Language Models-Cognitive Assistants in Manufacturing within Quality Management Systems

A novel human-centered architecture finally unlocks the potential of LLM-powered cognitive assistants to revolutionize quality management in manufacturing.

Marcos Galdino, Johanna Grahl, Tobias Hamann +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Tool Use & Agents

Kelechi G. Kalu +82w ago

A Longitudinal Study of Usability in Identity-Based Software Signing

Identity-based software signing may reduce key management burdens, but it relocates complexity to verification, configuration, and deployment, creating new usability challenges.

Kelechi G. Kalu, Kelechi G. Kalu, Hieu Tran +6

Code Generation & Program Synthesis Open-Source Models & Weights Tool Use & Agents

Reshabh K Sharma +32w ago

PAuth - Precise Task-Scoped Authorization For Agents

Current authorization models are too coarse for AI agents interacting with web services; PAuth offers a more precise solution by authorizing only the specific operations required for a user's task.

Reshabh K Sharma, Linxi Jiang, Zhiqiang Lin +1

Natural Language Processing Tool Use & Agents

Florian Holzbauer +62w ago

Malicious Or Not: Adding Repository Context to Agent Skill Classification

Security scanners flag nearly half of AI agent skills as malicious, but adding GitHub repository context reveals that the true number is closer to 0.5%.

Florian Holzbauer, David Schmidt, G. Gegenhuber +4

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago·also Corresponding Author, Waterloo

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

A Qwen3-8B model, trained with a new SFT+RLAIF recipe on a challenging new benchmark, SWE-QA-Pro, beats GPT-4o in repository-level code understanding.

Songcheng Cai, Z. Lyu, Yuansheng Ni +14

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Yishuai Cai +62w ago

CABTO: Context-Aware Behavior Tree Grounding for Robot Manipulation

Skip the manual effort: CABTO uses large models to automatically generate complete and consistent behavior tree systems for robot manipulation.

Yishuai Cai, Xinglin Chen, Yunxin Mao +4

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Lifu Tu +42w ago

LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agentic Settings

LLMs can ace the NL2SQL benchmark, but throw in some typos or rephrase the question, and their performance tanks, especially in agentic settings.

Lifu Tu, Rongguang Wang, Tao Sheng +2

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago·also CAS, Shandong Hi-speed Group Co.

ExpressMind: A Multimodal Pretrained Large Language Model for Expressway Operation

General LLMs can't handle the nuances of expressway operations, so this paper built ExpressMind, a specialized multimodal LLM that outperforms existing models in event detection, safety response, and traffic analysis.

Zihe Wang, Yihuan Wang, Haiyang Yu. Zhiyong Cui +4

Multimodal Models Natural Language Processing Tool Use & Agents

Sahil Sen +42w ago

Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

LLMs can now remember and reason about long-term conversations with significantly improved accuracy thanks to a new temporal-aware memory framework that structures dialogue into event calendars.

Sahil Sen, Elias Lumer, Anmol Gulati +2

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents