Reasoning & Chain-of-Thought
CapabilitiesChain-of-thought prompting, mathematical reasoning, logical inference, and step-by-step problem solving in LLMs.
Keywords
Top Labs in This Topic
Recent Papers
This paper introduces SMAPPO, a scalable multi-agent reinforcement learning framework for decentralized multi-robot management in multi-machine tending scenarios. SMAPPO employs a novel observation encoder to achieve input-size invariance, enabling it to handle varying numbers of agents, machines, and storage areas without retraining. Experiments demonstrate that SMAPPO outperforms MAPPO in full retraining, curriculum learning, zero-shot generalization, and adaptability under low initial training, showing significant improvements in productivity, collision avoidance, and parts delivery.
Introduces a novel observation encoder for MAPPO that enables zero-shot generalization to variable numbers of agents and machines in multi-agent reinforcement learning.
The paper introduces the Visual Reasoning Benchmark (VRB), a new dataset of 701 visual reasoning questions sourced from primary school exams in Zambia and India, designed to evaluate multimodal large language models (MLLMs). The VRB focuses on minimal-text images to simulate realistic classroom visual reasoning problems, covering tasks like analogy, pattern completion, and spatial matching. Experiments using the VRB reveal that MLLMs exhibit a "jagged frontier" of capabilities, performing well on static tasks like counting but struggling with dynamic spatial operations like folding and rotation.
Introduces the Visual Reasoning Benchmark (VRB), a novel dataset of classroom-authentic visual reasoning problems, to evaluate the spatial reasoning capabilities of MLLMs.
The paper introduces KeplerAgent, an LLM-based agent designed for symbolic equation discovery that mimics the scientific reasoning process of inferring physical properties before guessing equations. KeplerAgent coordinates physics-based tools to extract intermediate structure from data and uses this information to configure symbolic regression engines like PySINDy and PySR. Experiments on physical equation benchmarks demonstrate that KeplerAgent achieves significantly higher symbolic accuracy and robustness to noisy data compared to existing LLM and traditional baselines.
Introduces KeplerAgent, an agentic framework that enhances symbolic equation discovery by explicitly modeling the scientific reasoning process of inferring physical properties and using them to constrain the search space of candidate equations.
The paper introduces WavBench, a new benchmark for end-to-end spoken dialogue models that evaluates reasoning, colloquialism, and paralinguistics, addressing limitations of existing text-centric benchmarks. WavBench comprises three subsets: Pro (reasoning), Basic (colloquialism), and Acoustic (paralinguistics), designed to assess complex problem-solving, natural language fluency, and nuanced understanding/generation of acoustic cues. Evaluation of five state-of-the-art models using WavBench reveals critical insights into model performance across these dimensions, highlighting areas for improvement in building more robust spoken dialogue agents.
Introduces WavBench, a novel benchmark dataset and evaluation toolkit designed to comprehensively assess reasoning, colloquialism, and paralinguistic capabilities in end-to-end spoken dialogue models.
The paper investigates the phenomenon of "benchmark illusion," where LLMs with similar benchmark accuracy exhibit significant disagreement on individual data points. Using MMLU-Pro and GPQA benchmarks, the authors quantify the disagreement rates between various LLMs, including top-performing frontier models. They demonstrate that this disagreement can lead to substantial variability in scientific research outcomes when LLMs are used for data annotation and inference, impacting the reproducibility of results.
Demonstrates that seemingly convergent benchmark accuracy among LLMs masks substantial disagreement on individual data points, leading to significant consequences for scientific reproducibility.
The paper introduces PhyNiKCE, a neurosymbolic agentic framework that addresses the limitations of LLMs in autonomous CFD by decoupling neural planning from symbolic validation. PhyNiKCE uses a Symbolic Knowledge Engine to enforce physical constraints via a Deterministic RAG Engine, treating simulation setup as a Constraint Satisfaction Problem. Experiments using OpenFOAM and Gemini-2.5-Pro/Flash demonstrate a 96% improvement over baselines, a 59% reduction in self-correction loops, and a 17% decrease in LLM token consumption.
Introduces PhyNiKCE, a neurosymbolic framework that integrates neural planning with symbolic constraint enforcement to improve the reliability and efficiency of autonomous CFD agents.
The paper introduces Spatial Chain-of-Thought (SCoT), a framework that combines the spatial reasoning of Multimodal Large Language Models (MLLMs) with the generative capabilities of diffusion models for improved image generation. SCoT trains a diffusion model on interleaved text-coordinate instructions to enhance layout awareness and uses MLLMs as planners to generate detailed layout plans. Experiments show SCoT achieves state-of-the-art performance on image generation benchmarks and excels in complex reasoning and image editing tasks.
Introduces Spatial Chain-of-Thought (SCoT), a novel plug-and-play framework that bridges MLLM reasoning and diffusion model generation by training the diffusion model with interleaved text-coordinate instructions and using MLLMs for spatial planning.
The paper introduces PASCAL, a phase-aware scheduling algorithm designed to optimize the serving of reasoning-based LLMs by explicitly differentiating and prioritizing the reasoning phase to minimize Time-To-First-Token (TTFT). PASCAL employs a hierarchical scheduler with instance-level placement, intra-instance execution management, and dynamic migration at phase boundaries to balance load and reduce interference. Experiments using DeepSeek-R1-Distill-Qwen-32B show that PASCAL reduces tail TTFT by up to 72% while preserving answering phase SLO attainment, highlighting the benefits of phase-aware scheduling.
Introduces a phase-aware scheduling algorithm, PASCAL, that optimizes LLM serving by prioritizing the reasoning phase to reduce TTFT and employing controlled preemption and token pacing during the answering phase to maintain QoE.
This paper introduces Talk2DM, a plug-and-play module designed to enhance vehicle-road-cloud dynamic map (VRC-DM) systems with natural language querying and commonsense reasoning capabilities. To facilitate this, the authors created VRCsim, a VRC cooperative perception simulation framework, and VRC-QA, a question-answering dataset focused on spatial reasoning in mixed-traffic scenarios. Talk2DM leverages a novel chain-of-prompt (CoP) mechanism to integrate human-defined rules with LLM knowledge, achieving high accuracy and reasonable response times with models like Qwen3:8B, Gemma3:27B, and GPT-oss.
Introduces a chain-of-prompting method (CoP) that enables LLMs to effectively query and reason about dynamic maps by combining human-defined rules with the LLM's inherent commonsense knowledge.
The paper introduces MEME, a novel framework that models financial markets as an evolving ecosystem of investment narratives ("Modes of Thought") to improve portfolio construction. MEME uses a multi-agent extraction module to convert noisy data into Investment Arguments, then employs Gaussian Mixture Modeling to identify consensus within a semantic space and a temporal evaluation mechanism to track the lifecycle of these modes. Experiments on Chinese stock pools from 2023-2025 show MEME outperforms seven state-of-the-art baselines, demonstrating its ability to adapt to evolving market consensus.
Introduces a logic-oriented framework, MEME, that models financial markets as a dynamic ecosystem of evolving investment narratives to guide portfolio construction.
The paper addresses the problem of detecting training data contamination in Reinforcement Learning with Verifiable Rewards (RLVR) fine-tuned reasoning models, where standard likelihood-based detection methods are ineffective. They observe that RLVR training leads to a structural convergence in the model's generations for seen prompts, resulting in more rigid and similar outputs compared to unseen prompts. They introduce Min-$k$NN Distance, a black-box detector that leverages this convergence by measuring the average of the $k$ smallest nearest-neighbor edit distances between multiple completions of a given prompt.
Introduces Min-$k$NN Distance, a novel black-box detector, to identify RLVR training data by quantifying the structural convergence of reasoning trajectories induced by RLVR.
The paper introduces Execute-Summarize (ES), a framework that decouples task execution from workflow construction in LLMs, addressing the challenge of accurately translating LLM reasoning into structured workflows. ES first completes the task using available tools and then independently reconstructs a structured workflow from execution traces. Experiments on the newly introduced FlowBench demonstrate that ES outperforms existing methods, establishing a more reliable paradigm for grounding free-form LLM reasoning into structured workflows.
Introduces Execute-Summarize (ES), a novel framework that decouples task execution and workflow construction to improve the accuracy and robustness of structured workflow generation from LLM reasoning.
The paper introduces MuRGAt, a new benchmark for evaluating fact-level multimodal attribution in complex reasoning scenarios involving video, audio, and other modalities. MuRGAt requires models to generate answers with explicit reasoning and precise citations that specify modality and temporal segments. The authors also present an automatic evaluation framework that correlates with human judgments, revealing that current MLLMs often hallucinate citations even with correct reasoning, and that increasing reasoning depth can degrade attribution accuracy.
Introduces MuRGAt, a challenging benchmark and automatic evaluation framework for fact-level multimodal attribution that exposes limitations in current MLLMs' ability to ground reasoning in heterogeneous input sources.
The paper introduces SIGHT, a reinforcement learning framework designed to improve search-based reasoning in LLMs by mitigating redundancy and noise in search results. SIGHT uses Self-Evidence Support (SES) to distill search results into high-fidelity evidence and employs an Information Gain score to identify pivotal states for Dynamic Prompting Interventions like de-duplication and adaptive branching. By integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT achieves superior performance on single-hop and multi-hop QA benchmarks with fewer search steps compared to existing methods.
Introduces a novel reinforcement learning framework, SIGHT, that leverages self-evidence support and information-gain driven diverse branching to enhance search-based reasoning in LLMs.
This paper introduces dVoting, a novel test-time technique for Diffusion Large Language Models (dLLMs) that leverages their parallel decoding capabilities to enhance reasoning. dVoting iteratively refines token predictions by sampling multiple outputs, identifying inconsistent tokens, and regenerating them through a voting mechanism until convergence. Experiments on GSM8K, MATH500, ARC-C, and MMLU demonstrate consistent performance improvements, highlighting the potential of dVoting to boost dLLM reasoning without additional training.
Introduces dVoting, a parallelizable, training-free voting technique that leverages the unique capabilities of dLLMs to iteratively refine and improve reasoning performance by focusing on uncertain tokens.
This paper introduces Differentiable Modal Logic (DML) implemented via Modal Logical Neural Networks (MLNNs) to enable multi-agent systems to learn relationships like trust networks and causal chains from behavioral data. DML addresses the limitations of traditional modal logic, which requires manual specification of relationship structures. The authors demonstrate a neurosymbolic debugging framework across epistemic, temporal, deontic, and doxastic modalities, showing how logical contradictions can be formulated as learnable optimization objectives in scenarios ranging from diplomacy games to LLM hallucination detection.
Introduces Differentiable Modal Logic (DML) and Modal Logical Neural Networks (MLNNs) to learn interpretable relationship structures in multi-agent systems directly from data, replacing manual specification.
This paper explores the use of Mamba-2 hybrid operators within Tiny Recursive Models (TRM) for abstract reasoning, motivated by Mamba-2's inherent iterative refinement properties. By replacing Transformer blocks in TRM with Mamba-2 hybrids while maintaining parameter parity, the authors demonstrate improved performance on the ARC-AGI-1 benchmark. Specifically, the Mamba-2 hybrid TRM achieves a +2.0% improvement in pass@2 and a +4.75% improvement in pass@100, suggesting enhanced candidate coverage.
Demonstrates that Mamba-2 hybrid operators can effectively replace Transformer blocks within Tiny Recursive Models, leading to improved performance on abstract reasoning tasks.
The paper investigates how reasoning behaviors in LLMs influence reasoning quality by analyzing behavioral patterns in model responses. They find that injecting specific reasoning behavior patterns can significantly improve reasoning outcomes. Based on this, they propose two parameter-free optimization methods, InjectCorrect (imitating patterns from past correct answers) and InjectRLOpt (using a learned value function to generate behavior injectants), to steer the reasoning process.
Introduces InjectRBP, a novel framework for steering LLM reasoning by structurally injecting observed behavioral patterns, without requiring parameter updates.
This paper investigates the impact of communication delays on cooperation in LLM-based multi-agent systems using a Continuous Prisoner's Dilemma. The authors introduce the FLCOA framework to emphasize the importance of lower-layer factors like communication resources in multi-agent cooperation. Their simulations reveal a U-shaped relationship between communication delay and mutual cooperation, where increased delay initially leads to exploitation but excessive delay reduces exploitation cycles.
Demonstrates that communication delays in LLM-based multi-agent systems can significantly impact cooperation, leading to exploitation and a non-monotonic relationship between delay magnitude and mutual cooperation.
The paper addresses the problem of excessive and unnecessary reflection in Large Reasoning Models (LRMs) that leads to increased token consumption and computational overhead without improving accuracy, especially in smaller models. To mitigate this, they propose Adaptive Reflection and Length Coordinated Penalty (ARLCP), a reinforcement learning framework that dynamically balances reasoning efficiency and solution accuracy by introducing reflection and length penalties. Experiments on mathematical reasoning benchmarks using DeepSeek-R1-Distill-Qwen-1.5B and 7B models demonstrate that ARLCP achieves a superior efficiency-accuracy trade-off, reducing response length by up to 53.1% while improving accuracy by up to 5.8%.
Introduces ARLCP, a novel reinforcement learning framework with adaptive reflection and length penalties, to train LRMs for efficient reasoning by curtailing unnecessary reflective steps while preserving essential reasoning.
The paper introduces PRIME, a new benchmark designed to evaluate verifiers for process-outcome alignment in mathematical and engineering problem-solving, addressing the limitations of outcome-centric verification methods in Reinforcement Learning with Verifiable Rewards (RLVR). PRIME consists of 2,530 high-difficulty STEM problems and is used to demonstrate that existing verifiers often fail to identify flaws in the derivation process. The authors show that RLVR training using verifiers selected based on PRIME significantly improves performance on challenging math problem sets, and that PRIME's accuracy strongly correlates with RLVR training effectiveness.
Introduces PRIME, a novel benchmark for evaluating the ability of verifiers to align the reasoning process with the final outcome in complex STEM problems.
The paper introduces UniT, a framework for multimodal chain-of-thought test-time scaling that allows a unified model to iteratively reason, verify, and refine its outputs. UniT employs agentic data synthesis to create training data, trains a unified model, and uses flexible test-time inference to encourage cognitive behaviors. Experiments demonstrate that models trained on short reasoning trajectories generalize to longer inference chains, sequential chain-of-thought reasoning is more scalable than parallel sampling, and training on generation/editing trajectories improves out-of-distribution visual reasoning.
Introduces UniT, a novel framework enabling multimodal chain-of-thought test-time scaling for unified models, facilitating iterative reasoning, verification, and refinement.
This paper investigates whether GPT-4o possesses a genuine Theory of Mind (ToM) by evaluating its ability to model the causal relationship between mental states and behavior. The authors developed a novel evaluation framework based on a cognitively-grounded definition of ToM, probing for coherence, domain-generality, and consistency in the model's understanding of mental state causality. The key finding is that while GPT-4o can approximate human judgments in simple ToM tasks, it fails on logically equivalent tasks and demonstrates low consistency between predicted actions and inferred mental states, suggesting a lack of a robust ToM.
Demonstrates that GPT-4o, despite apparent social proficiency, lacks a coherent, domain-general, and consistent Theory of Mind by revealing inconsistencies in its mental state inferences and action predictions.
The paper introduces audio-interleaved reasoning for Large Audio Language Models (LALMs) to overcome the information bottleneck of one-time audio encoding. They propose a two-stage training framework involving supervised fine-tuning for salient audio segment localization and reinforcement learning to encourage re-listening. The resulting LALM, Echo, demonstrates improved performance on audio comprehension benchmarks, showcasing the benefits of dynamic audio re-listening during reasoning.
Introduces and validates audio-interleaved reasoning, enabling LALMs to actively re-listen to audio during the reasoning process, thereby improving audio comprehension.
This paper extends the Quantified Boolean Bayesian Network (QBBN) to incorporate negation and backward reasoning, completing Prawitz's simple elimination rules within a probabilistic factor graph framework. It introduces a typed logical language with role-labeled predicates and modal quantifiers, along with a typed slot grammar that deterministically compiles sentences to logical form. The authors demonstrate that while LLMs can assist in disambiguation, grammars are essential for structured parsing, and the QBBN architecture leverages LLMs for annotation and verification in logical information retrieval.
Introduces a complete logical information retrieval system combining LLMs, typed slot grammars, and a QBBN inference engine to reconcile formal semantics with modern language models.
The paper introduces Sci-CoE, a two-stage co-evolution framework for scientific reasoning LLMs that transitions from sparse supervision to unsupervised learning. Sci-CoE uses a small labeled dataset to bootstrap a Verifier and then employs a geometric reward mechanism incorporating consensus, reliability, and diversity to drive self-iteration on unlabeled data. Experiments on scientific benchmarks demonstrate that Sci-CoE improves complex reasoning capabilities and evaluation robustness.
Introduces a geometric reward mechanism that jointly considers consensus, reliability, and diversity to drive the co-evolution of scientific reasoning LLMs in an unsupervised manner.
The paper introduces UniDFlow, a unified discrete flow-matching framework for multimodal tasks, separating understanding and generation through low-rank adapters. It addresses objective interference and representation entanglement, while also incorporating reference-based multimodal preference alignment for enhanced faithfulness and controllability. UniDFlow achieves state-of-the-art results on eight benchmarks and demonstrates strong zero-shot generalization across various tasks.
Introduces a unified discrete flow-matching framework, UniDFlow, that decouples multimodal understanding and generation using task-specific adapters and reference-based preference alignment.
The paper introduces LawThinker, a legal reasoning agent designed to improve the accuracy and procedural compliance of legal reasoning in dynamic environments. LawThinker employs an Explore-Verify-Memorize strategy, integrating a DeepVerifier module to assess knowledge accuracy, fact-law relevance, and procedural compliance after each knowledge exploration step. Experiments on the J1-EVAL benchmark demonstrate a 24% improvement over direct reasoning and an 11% improvement over workflow-based methods, along with strong generalization across three static benchmarks.
Introduces an Explore-Verify-Memorize strategy with a DeepVerifier module to enforce verification as an atomic operation after each knowledge exploration step in legal reasoning.
The paper introduces Thinking with Drafting (TwD), a novel approach to visual reasoning that uses a domain-specific language (DSL) as an intermediate representation to bridge the gap between optical perception and logical reasoning in multimodal LLMs. TwD forces the model to draft its reasoning process into executable code, enabling the generation of deterministic visual proofs for self-verification. The authors validate TwD on VisAlg, a new visual algebra benchmark, demonstrating that TwD provides a superior cognitive scaffold for complex reasoning tasks.
Introduces Thinking with Drafting (TwD), a novel framework that uses a minimalist DSL to represent and execute reasoning steps, enabling verifiable visual proofs.
The paper addresses the structural blindness of Multimodal Large Language Models (MLLMs) when applied to engineering schematics by introducing a Vector-to-Graph (V2G) pipeline. V2G converts CAD diagrams into property graphs that explicitly represent component connectivity and dependencies. Experiments on an electrical compliance check benchmark demonstrate that V2G significantly improves accuracy compared to MLLMs, which struggle due to their pixel-driven approach.
Introduces a Vector-to-Graph (V2G) pipeline to transform CAD diagrams into property graphs, enabling MLLMs to overcome structural blindness and perform reliable schematic auditing.
The paper addresses the "Shallow Exploration Trap" in in-context learning, where autoregressive models struggle to generate long reasoning trajectories needed for broader state coverage. They propose Length-Incentivized Exploration (LIE), a reinforcement learning approach that rewards longer reasoning trajectories while penalizing redundancy. Experiments on Qwen3 and Llama models demonstrate that LIE improves in-context exploration, leading to performance gains of 4.4% on in-domain and 2.7% on out-of-domain tasks.
Introduces Length-Incentivized Exploration (LIE), a novel reinforcement learning method to encourage longer and more diverse reasoning trajectories in in-context learning by rewarding length and penalizing redundancy.
The paper introduces ThinkRouter, a confidence-aware routing mechanism that dynamically switches between latent and discrete reasoning spaces to improve reasoning efficiency and accuracy. It addresses the issue of noisy embeddings and overconfidence in incorrect latent reasoning trajectories by routing to the discrete token space when model confidence is low, and to the latent space otherwise. Experiments on STEM reasoning and coding benchmarks demonstrate that ThinkRouter significantly outperforms existing methods, achieving an average improvement of 19.70 points in Pass@1 and reducing generation length by up to 15.55%.
Introduces a confidence-aware routing mechanism, ThinkRouter, that adaptively selects between latent and discrete reasoning spaces based on model confidence to enhance reasoning performance.
This paper introduces INTENT, a novel inference-time planning framework for budget-constrained, tool-augmented LLMs that addresses the challenge of costly tool use in sequential decision-making. INTENT uses an intention-aware hierarchical world model to anticipate future tool usage and risk-calibrated costs, enabling more effective online decision-making. Experiments on a cost-augmented StableToolBench demonstrate that INTENT achieves superior task success while strictly adhering to budget constraints, even under dynamic market conditions.
Introduces INTENT, an inference-time planning framework that leverages an intention-aware hierarchical world model for budget-constrained tool use in LLMs.
The paper introduces STAR, a framework for predicting large language model performance from limited data by combining statistical methods with agentic reasoning. STAR uses specialized retrievers for external knowledge and embeds semantic features into Constrained Probabilistic Matrix Factorization (CPMF) to generate statistical expectations with uncertainty. A reasoning module based on Expectation Violation Theory (EVT) then refines these predictions, achieving a 14.46% improvement over statistical baselines under extreme data sparsity.
Introduces a hybrid framework, STAR, that integrates statistical expectations with agentic reasoning to improve LLM performance prediction, particularly under data sparsity.
The paper introduces PACE, a dual-level framework for compressing reasoning traces in Language Reasoning Models (LRMs) by addressing overthinking and excessive token usage. PACE employs prefix-protected optimization at the sequence level using decaying mixed rollouts to preserve valid reasoning paths while encouraging conciseness, and difficulty-aware penalty at the group level to dynamically adjust length constraints based on query complexity. Experiments on DeepSeek-R1-Distill-Qwen models (1.5B/7B) demonstrate that PACE achieves up to 55.7% token reduction and up to 4.1% accuracy improvement on math benchmarks, generalizing to code, science, and general domains.
Introduces a dual-level compression framework, PACE, that combines prefix-protected optimization and difficulty-aware penalties to reduce token usage and improve accuracy in language reasoning models.
The paper introduces MetaphorStar, an end-to-end visual reinforcement learning framework designed to improve image metaphor understanding and reasoning. They address the limitations of current MLLMs in grasping nuanced cultural, emotional, and contextual implications in images by using reinforcement learning. The proposed framework, comprising the TFQ-Data dataset, TFQ-GRPO visual RL method, and TFQ-Bench benchmark, achieves an average performance improvement of 82.6% on image implication benchmarks and outperforms state-of-the-art models like Gemini-3.0-pro.
Introduces an end-to-end visual reinforcement learning framework, MetaphorStar, to significantly improve image metaphor understanding and reasoning capabilities in AI systems.
The paper introduces LoopFormer, a looped Transformer architecture designed for budget-conditioned reasoning by training on variable-length trajectories. A shortcut-consistency training scheme is proposed to align trajectories of different lengths, ensuring informative representations across varying loop iterations. LoopFormer demonstrates strong performance on language modeling and reasoning tasks under compute constraints, suggesting looped Transformers are well-suited for adaptive language modeling.
Introduces a shortcut-consistency training scheme for looped Transformers that aligns variable-length trajectories, enabling budget-conditioned reasoning.
The paper introduces Latent Thoughts Tuning (LT-Tuning), a novel framework for latent space reasoning in LLMs that addresses feature collapse and instability issues present in existing latent reasoning paradigms. LT-Tuning employs a Context-Prediction-Fusion mechanism, combining contextual hidden states with predictive semantic guidance from the vocabulary embedding space to construct latent thoughts. Experiments demonstrate that LT-Tuning outperforms existing latent reasoning baselines by mitigating feature collapse and achieving improved reasoning accuracy, facilitated by a three-stage curriculum learning pipeline enabling dynamic switching between latent and explicit thinking.
Introduces a Context-Prediction-Fusion mechanism to construct more stable and informative latent thoughts by jointly leveraging contextual hidden states and predictive semantic guidance.
This paper identifies an implicit advantage symmetry in Group Relative Advantage Estimation (GRAE), the reward processing component of GRPO, that hinders exploration and difficulty adaptation in Reinforcement Learning with Verifiable Rewards (RLVR). The authors demonstrate that this symmetry leads to unchanged unsampled action logits and a bias towards medium-difficulty samples. They then propose Asymmetric GRAE (A-GRAE) to dynamically modulate exploration incentives and sample-difficulty focus.
Introduces Asymmetric GRAE (A-GRAE) to address the implicit advantage symmetry in GRPO, improving exploration and difficulty adaptation.
The paper introduces Agentic Verifier, a novel execution-based agent designed to improve the accuracy of LLMs on competitive programming tasks by actively generating discriminative test inputs to expose behavioral discrepancies among candidate solutions. This is achieved through multi-turn interaction with code execution environments, iteratively refining input generation using targeted counterexamples rather than random sampling. The agent is trained using a pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning, resulting in significant accuracy improvements (up to +10-15% in Best@K) across five competitive programming benchmarks compared to existing execution-based re-ranking methods.
Introduces an agentic verifier that actively generates discriminative test inputs to expose errors in candidate code solutions, significantly improving performance on competitive programming tasks.
This chapter proposes a human-centered privacy (HCP) framework for AI, addressing privacy risks across the AI development lifecycle from data collection to deployment. It integrates technical solutions like federated learning and differential privacy with user perspectives, ethical considerations, and regulatory landscapes. The framework provides design guidelines and case studies, advocating for a multidisciplinary approach to embed privacy into HCAI.
Introduces a human-centered privacy (HCP) framework that holistically integrates technical, ethical, and human factors perspectives to address privacy risks in human-centered AI systems.
This paper introduces VeruSyn, a data synthesis pipeline for generating a large-scale dataset of Verus-verified Rust programs to improve code-proof generation using LLMs. VeruSyn employs self-synthesis, tutorial-based synthesis, and agent trajectory synthesis to create a dataset of 6.9 million Rust programs with formal specifications and proofs. Fine-tuning a Qwen2.5-Coder-32B-Instruct model on this dataset achieves a better cost-proof tradeoff than state-of-the-art commercial models and outperforms existing research models.
Introduces VeruSyn, a novel data synthesis pipeline that generates a large-scale dataset of Verus-verified Rust programs, significantly improving the performance of LLMs in code-proof generation.
This paper introduces a novel path planning algorithm for substation robots that integrates deep reinforcement learning (DRL) with ant colony optimization (ACO) to address challenges posed by complex substation environments. The method uses a pheromone-guided exploration strategy to reduce ineffective exploration, a sample screening mechanism based on ACO path experience to improve Q-network training, and dynamic decision weight adjustment for transitioning from heuristic guidance to autonomous learning. Experimental results demonstrate that the proposed algorithm achieves higher sample efficiency, shorter path lengths, and better dynamic obstacle avoidance compared to PPO, DDQN, and A*, with field validation confirming improved task completion rates.
Integrates ant colony optimization with deep reinforcement learning to create a synergistic path planning framework that enhances exploration, sample efficiency, and adaptability for substation robots.
This paper identifies three key dimensions of safety for foundation model (FM)-enabled robots: action, decision, and human-centered safety, arguing that existing methods are insufficient for open-ended real-world scenarios. To address this, they propose a modular safety guardrail architecture with monitoring and intervention layers to ensure comprehensive safety across the autonomy stack. The paper further suggests cross-layer co-design strategies, such as representation alignment and conservatism allocation, to improve the speed and effectiveness of safety enforcement.
Proposes a modular safety guardrail architecture, composed of monitoring and intervention layers, to address the multifaceted safety challenges of deploying foundation model-enabled robots in real-world environments.
This paper introduces a human-simulation-based framework that enables LLM-driven AI systems to autonomously formulate questions and define tasks by reasoning about internal states, environmental observations, and interactions with other agents. By treating question formation as a distinct decision process, the framework integrates internal, environment-aware, and inter-agent-aware prompting scopes to broaden cognitive coverage. Experimental results in a multi-agent simulation demonstrate that environment-aware and inter-agent-aware prompting significantly reduce "no-eat" events, indicating improved adaptability and decision quality.
Introduces a novel framework for autonomous question formation in LLM-driven AI systems, enabling them to proactively identify and address relevant problems in dynamic environments.
This study examines the impact of different AI-driven nudging strategies within a digital health platform on Indigenous youth compliance with mental health assessments. A natural experiment was created by system disruptions that altered the types of nudges delivered (system-triggered, non-personalized, personalized), allowing the researchers to measure the effect on assessment completion rates. The key finding is that personalized nudges, specifically "Best Picture" messages, significantly improved compliance, highlighting the importance of two-way communication in digital health interventions for this population.
Demonstrates the critical role of personalized, scientist-triggered nudges in maintaining engagement and compliance within a digital health platform designed for Indigenous youth mental health.
The paper introduces ReasonEdit, a novel model editing framework for vision-language models (VLMs) specifically designed to address reasoning-heavy tasks. ReasonEdit incorporates human reasoning by storing explanations in a codebook and retrieving relevant facts during inference using a topology-balanced multimodal embedding method. Experiments across four VLMs and multiple rationale-based VQA datasets demonstrate that ReasonEdit achieves state-of-the-art editing performance and improves edit generalization by leveraging human reasoning.
Introduces ReasonEdit, a VLM editor that incorporates human reasoning during the editing process through a codebook and topology-balanced multimodal embedding retrieval mechanism.
The paper introduces AdNanny, a unified reasoning-centric LLM fine-tuned from a 671B DeepSeek-R1 checkpoint for various offline advertising tasks. They construct reasoning-augmented corpora with structured supervision and natural language explanations, and then use multi-task supervised fine-tuning with adaptive reweighting followed by reinforcement learning to align with online advertising objectives. Deployed in Bing Ads, AdNanny reduces manual labeling effort and improves accuracy, demonstrating a scalable and cost-effective solution by consolidating task-specific models.
The paper demonstrates that a single, reasoning-centric LLM, AdNanny, can effectively replace multiple task-specific models for offline advertising tasks, leading to improved accuracy and reduced manual effort.
This paper introduces LLM-Geo, a framework integrating the open-source DeepSeek-Coder model (specifically the 1.3B parameter version) into a GIS platform called DS-GeoAI, to address limitations of commercial LLM-based GIS solutions. The framework aims to reduce costs and increase accessibility by eliminating API dependencies and enabling local deployment. The DS-GeoAI platform achieves 90% accuracy in generating Python code for spatial analysis tasks after automated debugging, demonstrating comparable performance to commercial solutions with significantly lower operational costs.
Demonstrates the feasibility of using a lightweight, open-source LLM like DeepSeek-Coder for complex spatial analysis tasks within a GIS framework, achieving high accuracy and significant cost reduction compared to API-based commercial solutions.
The paper introduces HERMES, a risk-aware end-to-end autonomous driving framework that integrates vision-language models with explicit long-tail risk cues for improved trajectory planning in complex, mixed-traffic scenarios. HERMES leverages a foundation-model-assisted annotation pipeline to generate structured Long-Tail Scene Context and Long-Tail Planning Context, which are then fused with multi-view perception and historical motion cues in a Tri-Modal Driving Module. Experimental results on a real-world long-tail dataset demonstrate that HERMES outperforms existing end-to-end and VLM-driven approaches, particularly in handling long-tail scenarios.
Introduces a holistic risk-aware end-to-end multimodal driving framework (HERMES) that explicitly incorporates long-tail risk cues into trajectory planning using a foundation-model-assisted annotation pipeline and a Tri-Modal Driving Module.

