March 11 – March 18, 2026

Reasoning & Chain-of-Thought - Weekly Roundup

100 papers published across 5 labs.

17% acceleration

Selected Labs publishing this week

Tsinghua AI3 Meta AI2 Mila1 Stanford HAI1 DAMO1

Top Papers

Mar 18, 2026

Young Bin Park +12w ago

Graph-Native Cognitive Memory for AI Agents: Formal Belief Revision Semantics for Versioned Memory Architectures

Forget finetuning – Kumiho's graph-native memory lets you swap in a better LLM and instantly double your agent's reasoning accuracy on complex cognitive tasks.

Young Bin Park, Young-Bin Park

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

Tianhui Zhang +22w ago

Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

Training on synthetically generated data can significantly boost both the diversity and quality of commonsense reasoning in LLMs, outperforming models trained on scarce human-annotated data.

Tianhui Zhang, Bei Peng, D. Bollegala

Data Curation & Synthetic Data Natural Language Processing Reasoning & Chain-of-Thought

Guangzhi Wang +32w ago

CRE-T1 Preview Technical Report: Beyond Contrastive Learning for Reasoning-Intensive Retrieval

Ditch static embeddings: Generative retrieval, powered by reinforcement learning, lets models dynamically reason about relevance, outperforming larger contrastively-trained models on reasoning-intensive tasks.

Guangzhi Wang, Yinghao Jiao, Ying Jiao +1

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

2w ago

A Unified Language Model for Large Scale Search, Recommendation, and Reasoning

Forget tool-augmented systems: NEO shows you can consolidate search, recommendation, and reasoning into a single language-steerable LLM by representing items as SIDs and interleaving them with natural language.

Marco De Nadai, Edoardo D'Amico, Max Lefarov +23

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

2w ago·also Netease Yidun AI Lab ∗Equal contribution

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Achieve state-of-the-art fine-grained visual recognition without training by adaptively invoking reasoning in a Large Vision-Language Model only when needed, significantly reducing computational overhead.

Jingxiao Yang, DaLin He, Miao Pan +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

All Papers (100)

Mar 18, 2026

Young Bin Park +12w ago

Graph-Native Cognitive Memory for AI Agents: Formal Belief Revision Semantics for Versioned Memory Architectures

Forget finetuning – Kumiho's graph-native memory lets you swap in a better LLM and instantly double your agent's reasoning accuracy on complex cognitive tasks.

Young Bin Park, Young-Bin Park

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

Tianhui Zhang +22w ago

Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

Training on synthetically generated data can significantly boost both the diversity and quality of commonsense reasoning in LLMs, outperforming models trained on scarce human-annotated data.

Tianhui Zhang, Bei Peng, D. Bollegala

Data Curation & Synthetic Data Natural Language Processing Reasoning & Chain-of-Thought

Guangzhi Wang +32w ago

CRE-T1 Preview Technical Report: Beyond Contrastive Learning for Reasoning-Intensive Retrieval

Guangzhi Wang, Yinghao Jiao, Ying Jiao +1

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

2w ago

A Unified Language Model for Large Scale Search, Recommendation, and Reasoning

Marco De Nadai, Edoardo D'Amico, Max Lefarov +23

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

2w ago·also Netease Yidun AI Lab ∗Equal contribution

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Jingxiao Yang, DaLin He, Miao Pan +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

2w ago

Proactive Knowledge Inquiry in Doctor-Patient Dialogue: Stateful Extraction, Belief Updating, and Path-Aware Action Planning

Instead of passively transcribing doctor-patient dialogues, this system actively models what's known, what's missing, and what questions to ask next, paving the way for more intelligent EMR systems.

Zhenhai Pan, Yan Liu, Jia You

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Chengwei Wei +42w ago

InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

LLMs can achieve state-of-the-art reasoning accuracy with significantly fewer tokens by rewarding intermediate reasoning steps that maximize information gain and maintain monotonic progress.

Chengwei Wei, Jung-jae Kim, Longyin Zhang +2

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

2w ago·also NTU, UQ

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

Chain-of-thought prompting makes large language models smarter, but it also makes them less safe, a problem this paper tackles by forcing models to think about safety *before* reasoning.

Jianan Chen, Zhifang Zhang, Shuo He +3

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

2w ago·also Shenzhen University

TRiMS: Real-Time Tracking of Minimal Sufficient Length for Efficient Reasoning via RL

LLMs can slash over 80% of their chain-of-thought tokens with a minor accuracy boost, thanks to a new RL-based method that targets the "Minimal Sufficient Length" of reasoning.

Tingcheng Bian, Jinchang Luo, Mingquan Cheng +5

Reasoning & Chain-of-Thought Training Efficiency & Optimization

Haozheng Luo +32w ago

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

By aligning hidden representations, CRAFT achieves a remarkable 79% improvement in reasoning safety, suggesting that latent-space interventions are a potent defense against jailbreaks.

Haozheng Luo, Yimin Wang, Jiahao Yu +1

Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Aivo Olev +22w ago·also TalTech

Multi-Source Evidence Fusion for Audio Question Answering

Grounding LALM reasoning in diverse, reliability-weighted acoustic evidence blows away the competition in Audio Question Answering, proving that verifiable chains beat black boxes.

Aivo Olev, Tanel Alumäe, Tanel Alumae

Reasoning & Chain-of-Thought Speech & Audio Tool Use & Agents

Risham Sidhu +32w ago

Grid Spatial Understanding: A Dataset for Textual Spatial Reasoning over Grids, Embodied Settings, and Coordinate Structures

LLMs struggle with spatial reasoning in embodied settings and 3D structure identification even when exposed to visual modalities, but fine-tuning smaller models offers a surprisingly effective alternative to brute-force scaling.

Risham Sidhu, Risham Sidhu, Julia Hockenmaier +1

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Robotics & Embodied AI

Kevin Qu +52w ago

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Forget expensive 3D training data: Loc3R-VLM shows how to give 2D vision-language models strong 3D spatial reasoning by distilling knowledge from a pretrained 3D foundation model using only monocular video.

Kevin Qu, Haozhe Qi, Mihai Dusmanu +3

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Tsinghua AI2w ago·also Meta AI, Mila

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

Mimicking human cognition, FLAIR lets dialogue models "think while listening," boosting performance without adding latency.

Donghang Wu, Tianyu Zhang, Yuxin Li +6

Natural Language Processing Reasoning & Chain-of-Thought Speech & Audio

Akshat Rana +32w ago

SG-CoT: An Ambiguity-Aware Robotic Planning Framework using Scene Graph Representations

Scene graphs plus LLMs let robots ask clarifying questions, boosting multi-agent task success by 15%.

Akshat Rana, Peeyush Agarwal, Krishan Rana +1

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

Zichen Xie +12w ago

Can LLMs Reason Like Automated Theorem Provers for Rust Verification? VCoT-Bench: Evaluating via Verification Chain of Thought

LLMs can't reason their way through Rust verification, struggling to complete proofs even with substantial hints, revealing a critical gap in their ability to handle the rigorous demands of secure software development.

Zichen Xie, Wenxi Wang

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

2w ago

CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

LLMs can escape the trap of confidently wrong reasoning by co-evolving a generator and verifier from a single model, bootstrapping each other to break free from flawed consensus.

Teng Pan, Tengyu Pan, Yuchen Yan +7

Reasoning & Chain-of-Thought RLHF & Preference Learning

2w ago

Retrieval-Augmented LLM Agents: Learning to Learn from Experience

Retrieval-augmented LLM agents can learn to learn from experience, achieving significantly better generalization on unseen tasks by combining the strengths of fine-tuning and in-context retrieval.

Thomas Palmeira Ferraz, Romain Deffayet, Vassilina Nikoulina +2

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

2w ago

PJB: A Reasoning-Aware Benchmark for Person-Job Retrieval

Stop chasing leaderboard gains on generic benchmarks: PJB reveals that domain-specific weaknesses in person-job retrieval far outweigh the benefits of general model upgrades, and that query understanding modules can actually hurt performance.

Guangzhi Wang, Xiaohui Yang, Kai Li +4

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Hyun Ryu +52w ago·also KAIST

Argument Reconstruction as Supervision for Critical Thinking in LLMs

Training LLMs to reconstruct arguments boosts their critical thinking abilities across diverse tasks, suggesting a promising new direction for imbuing reasoning skills.

Hyun Ryu, Gyouk Chu, Gregor Betz +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Mohsen Arjmandi2w ago

Sensi: Learn One Thing at a Time -- Curriculum-Based Test-Time Learning for LLM Game Agents

LLM agents can learn task structure at test time with 50-94x greater sample efficiency using a curriculum-based learning system, but this reveals a critical bottleneck in perceptual grounding that must be addressed.

Mohsen Arjmandi

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Jingchun Yang is with Northeast2w ago

Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning

Dashcam videos can now be directly linked to legal responsibility determinations via a novel multimodal dataset and legal reasoning framework, outperforming existing LLMs and agent-based systems.

Jingchun Yang, Jinchang Zhang

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

2w ago·also Meta AI, UNC

Text-to-Stage: Spatial Layouts from Long-form Narratives

LLMs can now infer plausible stage layouts from unstructured text alone, opening up new possibilities for automated media production.

Jefferson Hernandez, Swarnadeep Saha, Chenxi Whitehouse +8

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

2w ago·also PolyU

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

Forget prompt engineering: AgentFactory lets LLM agents self-evolve by accumulating and refining executable Python subagents, making task re-execution more reliable and efficient.

Zhang Zhang, Shuqi Lu, Hongjin Qian +2

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

2w ago·also Didichuxing Co. Ltd

A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication

An 8B parameter model, RideJudge, outperforms 32B baselines in ride-hailing dispute adjudication by aligning visual semantics with evidentiary protocols, achieving 88.41% accuracy.

Weiming Wu, Zi-Jian Cheng, Jie Meng +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Corentin Royer +52w ago

Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain

Forget expensive human annotations – this new method uses information theory to automatically score each step of an LLM's reasoning process, making chain-of-thought supervision scalable and efficient.

Corentin Royer, Debarun Bhattacharjya, D. Bhattacharjya +3

Reasoning & Chain-of-Thought RLHF & Preference Learning

Mar 17, 2026

2w ago·also China Academy of Space Technology, Harvard

When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

Stop wasting compute: this RL-trained orchestration policy adaptively decides when your embodied agent should reason with an LLM, slashing latency and boosting task success compared to fixed strategies.

Jun Liu, Pu Zhao, Zhenglun Kong +12

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

Rebecca Ansell +12w ago

How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

LLMs can't crack Clue: even state-of-the-art models struggle with multi-step deductive reasoning in a simulated text-based game, and fine-tuning doesn't reliably help.

Rebecca Ansell, Autumn Toney-Wails

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Lucas Bandarkar +22w ago

Large Reasoning Models Struggle to Transfer Parametric Knowledge Across Scripts

LLMs struggle to transfer knowledge across different writing scripts, even within the same language, revealing a critical limitation in current cross-lingual understanding.

Lucas Bandarkar, Alan Ansell, Trevor Cohn

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Prashanth Vijayaraghavan +72w ago

SYMDIREC: A Neuro-Symbolic Divide-Retrieve-Conquer Framework for Enhanced RTL Synthesis and Summarization

Symbolic planning unlocks significant gains in RTL synthesis and summarization, boosting LLM performance by 20% without fine-tuning.

Prashanth Vijayaraghavan, Apoorva Nitsure, Luyao Shi +5

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Ruisi Wang +82w ago

Demystifing Video Reasoning

Video generation models don't reason frame-by-frame as previously thought; instead, they explore multiple solutions during diffusion denoising and progressively converge, revealing a "Chain-of-Steps" mechanism.

Ruisi Wang, Fanyi Pu, Wanqi Yin +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

ISP RAS Research Center for Trusted AI2w ago·also HSE University, S-NLP Group, Skoltech

Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures

LLMs often fail to update their final predictions after interventions on intermediate reasoning steps, suggesting that these structures function more as influential context than stable causal mediators.

Oleg Somov, Mikhail Chaichuk, Mikhail Seleznyov +2

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

2w ago

Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

VL-PRMs often reward hallucinated visual premises and penalize correct grounded statements, but this work shows you can fix that by explicitly verifying visual facts, leading to significant gains in reranking accuracy.

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning

Fucai Ke +72w ago

VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations

Despite advances in vision-language models, reasoning across sparse, multi-view observations remains surprisingly unsolved, with current models barely outperforming random guessing on a new benchmark.

Fucai Ke, Zhixi Cai, Boying Li +5

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Xiaokun Sun +22w ago

When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

Video-LLMs can hallucinate and perform *worse* with chain-of-thought reasoning due to "visual anchor drifting," but a simple frame repetition strategy guided by a learned scoring function can fix it.

Xiaokun Sun, Haoyu Cao, Linli Xu

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

2w ago

Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy

Chain-of-Thought reasoning in LLMs is a double-edged sword, reducing sycophancy in final answers but simultaneously masking it with deceptive, logically inconsistent justifications.

Zhaoxin Feng, Zheng Chen, Jianfei Ma +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

Lei Zhang +52w ago

Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition

MLLMs often fail at multimodal emotion recognition due to premature commitment to data priors, but a new architecture, HyDRA, uses reinforcement learning to synthesize evidence-grounded rationales and significantly outperforms strong baselines, especially in ambiguous scenarios.

Lei Zhang, Haoxun Li, Hanlei Shi +3

Multimodal Models Natural Language Processing Reasoning & Chain-of-Thought

2w ago·also Fudan, NJU, Shanghai AI Lab

Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

By explicitly exposing the model's reasoning process during SVG generation, CTRL-S achieves higher task success rates, superior SVG code quality, and exceptional visual fidelity compared to existing methods.

Haomin Wang, Qianli Ma, Shengyuan Ding +2

Code Generation & Program Synthesis Multimodal Models Reasoning & Chain-of-Thought

2w ago·also Stanford HAI, Xiaomi EV, Yale

Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

LLMs' chain-of-thought reasoning often falls apart due to factual incompleteness, with errors compounding across multiple hops, as revealed by a new multi-hop QA dataset.

Xiaojie Gu, Sherry T. Tong, Aosong Feng +8

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Robert Welch +22w ago

The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models

Chain-of-thought reasoning makes vision-language models *more* overconfident, even when it improves accuracy.

Robert Welch, Emir Konuk, Kevin Smith

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

2w ago

Self-Aware Markov Models for Discrete Reasoning

Forget fixed schedules: this new discrete diffusion model learns when to stop, adapting computation to the complexity of each reasoning problem.

Gregor Kornhardt, Jannis Chemseddine, Christian Wald +1

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Reasoning & Chain-of-Thought

2w ago

Decoding the Critique Mechanism in Large Reasoning Models

LRMs can often recover from injected errors in their reasoning steps, revealing a hidden "critique" ability that can be harnessed to improve performance without additional training.

Hoang Phan, Quang H. Nguyen, Hung T. Q. Le +3

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Zelin Zhang +22w ago

When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective

Unsupervised RL for math reasoning hinges on a model's pre-existing logical abilities, and its success can be predicted by whether the training trajectory stays within stable "manifolds" of good solutions.

Zelin Zhang, Fei Cheng, Chenhui Chu

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

2w ago

EmoLLM: Appraisal-Grounded Cognitive-Emotional Co-Reasoning in Large Language Models

LLMs can be taught emotional intelligence by explicitly reasoning about user appraisals, leading to more emotionally appropriate and factually reliable responses.

Yifei Zhang, Mingyang Li, Henry Gao +1

Constitutional AI & AI Ethics Natural Language Processing Reasoning & Chain-of-Thought

Parsa Mirtaheri +12w ago

Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing

LLMs' true reasoning can be detected via activation probing even when their chains-of-thought are misleading rationalizations, revealing a discrepancy between internal processing and external justification.

Parsa Mirtaheri, Mikhail Belkin

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

2w ago

Retrieving Counterfactuals Improves Visual In-Context Learning

Counterfactual examples supercharge visual in-context learning, enabling smaller vision-language models to outperform larger ones by focusing on causal relationships rather than superficial correlations.

Guangzhi Xiong, Sanchit Sinha, Zhenghao He +1

Computer Vision Multimodal Models Reasoning & Chain-of-Thought+1

2w ago

ARISE: Agent Reasoning with Intrinsic Skill Evolution in Hierarchical Reinforcement Learning

ARISE lets language models solve math problems better by learning and reusing successful solution strategies, outperforming existing RL methods, especially on harder, out-of-distribution problems.

Yu Li, Rui Miao, Zhengling Qi +1

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Tik Yu Yim +42w ago

ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning

LLMs can gain substantial financial reasoning skills without fine-tuning, thanks to a new framework that distills knowledge into human-readable, version-controlled skill artifacts.

Tik Yu Yim, Wenting Tan, Sum Yee Chan +2

Eval Frameworks & Benchmarks Inference & Quantization Reasoning & Chain-of-Thought

Tianyi Huang +12w ago

CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

Instead of just gathering more context, turn retrieval into a mechanism for actively testing and refining a provisional answer, yielding substantial gains in factual QA accuracy.

Tianyi Huang, Ying Kai Deng

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Zhenghua Bao +12w ago

IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time

Achieve state-of-the-art multi-hop question answering by pre-computing bridging facts at index time, eliminating the need for complex online reasoning or graph traversal.

Zhenghua Bao, Yidong Shi

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

2w ago·also NII, SOKENDAI, The Graduate University for Advanced, UofT

Domain-Independent Dynamic Programming with Constraint Propagation

Constraint propagation can significantly boost dynamic programming by pruning states and transitions, but the overhead needs further optimization.

Imko Marijnissen, J. Christopher Beck, Emir Demirović +1

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

Yelysei Bondarenko +242w ago

Efficient Reasoning on the Edge

Shrinking LLM reasoning for mobile devices is now possible: LoRA adapters, RL-based budget forcing, and KV-cache tricks let Qwen2.5-7B reason efficiently on-device.

Yelysei Bondarenko, Thomas Hehn, Rob Hesselink +22

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

I. S. W. B. Prasetya +22w ago

Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization

LLMs struggle to formalize program post-conditions from natural language, with even the best models failing to correctly formalize all tasks, highlighting a critical gap in their ability to bridge natural language understanding and formal verification.

I. S. W. B. Prasetya, Fitsum Meshesha Kifetew, Davide Prandi

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Sahil Sen +42w ago

Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

LLMs can now remember and reason about long-term conversations with significantly improved accuracy thanks to a new temporal-aware memory framework that structures dialogue into event calendars.

Sahil Sen, Elias Lumer, Anmol Gulati +2

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Hugo Math2w ago

Learning to Predict, Discover, and Reason in High-Dimensional Discrete Event Sequences

Automating vehicle fault diagnostics by treating error codes as a language unlocks scalable predictive maintenance and causal understanding in complex automotive systems.

Hugo Math

Reasoning & Chain-of-Thought World Models & Planning

2w ago

Dual Consensus: Escaping from Spurious Majority in Unsupervised RLVR via Two-Stage Vote Mechanism

LLMs can escape the trap of converging on popular but incorrect answers in unsupervised RLVR by temporarily "unlearning" and exploring diverse response options.

Kaixuan Du, Hang Zhang, Yukun Wang +2

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

2w ago

Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

Supervised fine-tuning can be dramatically improved by explicitly encouraging exploration of low-confidence data and suppressing high-confidence errors, leading to sustained gains in mathematical reasoning even after extensive RLVR training.

Yongyu Mu, Jiali Zeng, Fandong Meng +2

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

2w ago·also Northwestern

Adaptive Theory of Mind for LLM-based Multi-Agent Coordination

Mismatched levels of "mind-reading" between AI agents tank their ability to collaborate, but a simple adaptive strategy can fix it.

Chunjiang Mu, Ya Zeng, Qiaosheng Zhang +5

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Al Jaber Mahmud +12w ago

Geometry-Aligned LLM Fine-Tuning for Sequential Narrow-Opening Planning

LLMs can now plan complex, sequential robotic maneuvers through narrow spaces by learning from human demos and refining with geometric rewards, outperforming traditional methods.

Al Jaber Mahmud, Xuan Wang

Reasoning & Chain-of-Thought Robotics & Embodied AI World Models & Planning

2w ago·also Fudan

DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

By prioritizing diversity over accuracy in experience replay, DyJR significantly boosts LLM reasoning performance in RL, outperforming GRPO and other baselines without sacrificing training efficiency.

Long Li, Zhijian Zhou, Tianyi Wang +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Tsinghua AI2w ago·also DAMO

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Multi-hop data synthesis using HopChain boosts VLM performance across a wide range of tasks, with gains of over 50 points in accuracy for ultra-long-context reasoning.

Shenzhi Wang, Shixuan Liu, Jing Zhou +8

Data Curation & Synthetic Data Multimodal Models Reasoning & Chain-of-Thought

Tsinghua AI2w ago·also UMD

When AI Navigates the Fog of War

LLMs can exhibit surprising "strategic realism" when analyzing an ongoing geopolitical conflict, but their reasoning falters in politically ambiguous situations, revealing critical domain-specific limitations.

Ming Li, Xirui Li, Tianyi Zhou

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought World Models & Planning

Mar 16, 2026

2w ago·also KAIST, Myongji University

CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

Small language models can achieve surprisingly robust question answering by actively clustering their memories into semantically coherent groups, outperforming standard retrieval methods.

Taeyun Roh, Wonjune Jang, Junha Jung +1

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Yi-Zhuo Ma +102w ago

Mitigating KG Quality Issues: A Robust Multi-Hop GraphRAG Retrieval Framework

Imperfect knowledge graphs can lead to retrieval drift and hallucinations in multi-hop reasoning, but C2RAG offers a robust solution that improves EM by 3.4% and F1 by 3.9% over existing methods.

Yi-Zhuo Ma, Shuang Liang, Rongzheng Wang +8

Data Curation & Synthetic Data Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Jeonghye Kim +42w ago·also Microsoft Research

Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty

LLMs' "Aha!" moments aren't about magic tokens, but about explicitly verbalizing and managing uncertainty during reasoning, which drives performance.

Jeonghye Kim, Xufang Luo, Minbeom Kim +2

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Yu Pan +72w ago

SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

LLMs are still wide open to jailbreaks, but this new method cuts attack success rates by nearly 5x by monitoring *intermediate* reasoning steps, not just the final output.

Yu Pan, Wenlong Yu, Tiejun Wu +5

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Durham University2w ago

Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats

LLMs can ace grading physics problems with clear solutions, but fall flat when judging essays, revealing that their assessment skills hinge more on task clarity than raw intelligence.

Will Yeadon, Tom Hardy, Paul Mackay +1

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Siyuan Wang +62w ago·also Normal University

Unlocking the Value of Text: Event-Driven Reasoning and Multi-Level Alignment for Time Series Forecasting

Forget relying solely on numbers: VoT unlocks richer time series forecasts by fusing LLM reasoning over event-related text with multi-level data alignment.

Siyuan Wang, Peng Chen, Yihang Wang +4

Multimodal Models Natural Language Processing Reasoning & Chain-of-Thought

MiroMind Team S. Bai +402w ago·also CAS

MiroThinker-1.7&H1: Towards Heavy-Duty Research Agents via Verification

By verifying its reasoning steps both locally and globally, MiroThinker-H1 achieves state-of-the-art performance in complex research tasks, demonstrating the power of integrated verification for reliable multi-step problem solving.

MiroMind Team S. Bai, L. Bing, L. Lei +38

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

2w ago

InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems

LLMs still fall short when it comes to reasoning about real-world policy interventions and causal study design, as revealed by the new InterveneBench benchmark.

Shaojie Shi, Zhengyu Shi, Lingran Zheng +17

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Vasily Ilin2w ago

Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau Equilibrium

AI can now semi-autonomously formalize complex mathematical theorems like the Vlasov-Maxwell-Landau equilibrium, even outpacing traditional mathematical research.

Vasily Ilin

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

College of Computer Science and Software Engineering2w ago·also Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)

SAGE: Multi-Agent Self-Evolution for LLM Reasoning

Forget RLHF and massive datasets: SAGE co-evolves reasoning abilities in LLMs using only a small seed set and a clever quartet of self-improving agents.

Yulin Peng, Xinxin Zhu, Chenxing Wei +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Erick Silva +22w ago

CRASH: Cognitive Reasoning Agent for Safety Hazards in Autonomous Driving

An LLM agent can accurately pinpoint perception and planning failures as the leading causes in over half of real-world autonomous vehicle incidents.

Erick Silva, Rehana Yasmin, Ali Shoker

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Robotics & Embodied AI

Nicolas Schuler +32w ago

Beyond Monolithic Models: Symbolic Seams for Composable Neuro-Symbolic Architectures

Symbolic seams offer a blueprint for AI systems that are not black boxes, but assemblies of interchangeable parts, paving the way for more transparent and adaptable AI.

Nicolas Schuler, Vincenzo Scotti, Raffaela Mirandola +1

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

2w ago·also HUST

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

A 7B model trained with RL can outperform 72B-scale general MLLMs in robotic manipulation process supervision by explicitly reasoning about progress toward the final task goal.

Yibin Liu, Yaxing Lyu, Daqi Gao +6

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning+1

Anton Antonov +42w ago

PMAx: An Agentic Framework for AI-Driven Process Mining

Democratizing process mining, PMAx uses a multi-agent system to translate natural language queries into precise process insights without sacrificing data privacy or mathematical accuracy.

Anton Antonov, Humam Kourani, Alessandro Berti +2

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

2w ago·also ZJU

Advancing Multimodal Agent Reasoning with Long-Term Neuro-Symbolic Memory

Neuro-symbolic memory lets multimodal agents beat purely neural memory systems by up to 12.5% on constrained reasoning tasks.

Rongjie Jiang, Jianwei Wang, Gengda Zhao +2

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Erik Y. Wang +132w ago

HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

GPT 5.4 Pro may have made novel contributions to mathematics, outperforming published results on two unsolved problems, as measured by the new HorizonMath benchmark.

Erik Y. Wang, Sumeet Motwani, S. Motwani +11

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

2w ago

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Test-time RL, intended to improve LLM reasoning, can backfire spectacularly, amplifying existing safety flaws and even degrading reasoning itself when exposed to adversarial prompts.

Vanshaj Khattar, Md. Rafi Ur Rashid, Moumita Choudhury +3

Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Gal Dalal +32w ago

More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search

Turns out, blindly widening the beam search in your LLM can actually *hurt* performance due to overestimation bias, and the optimal width depends critically on your scorer's signal-to-noise ratio.

Gal Dalal, Assaf Hallak, Gal Chechik +1

Eval Frameworks & Benchmarks Inference & Quantization Reasoning & Chain-of-Thought

2w ago

Video-CoE: Reinforcing Video Event Prediction via Chain of Events

Current MLLMs struggle with video event prediction due to poor logical reasoning and visual information utilization, but a new "Chain of Events" training paradigm significantly boosts performance.

Jing Tang

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

2w ago

CCTU: A Benchmark for Tool Use under Complex Constraints

Even the best LLMs fail to follow complex constraints in tool use more than 50% of the time, revealing a critical weakness in real-world agent deployment.

Junjie Ye, Guoqiang Zhang, Wenjie Fu +2

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Fan Huang +22w ago

Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

LLMs don't stick to their ethical guns: they hop between moral frameworks mid-reasoning, making them vulnerable to manipulation.

Fan Huang, Haewoon Kwak, Jisun An

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Jingxiang Chen +152w ago·also SMU

Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

Speech LLMs can now better understand your emotions: a new RL approach boosts paralinguistic understanding by 8-12% over state-of-the-art models.

Jingxiang Chen, Minseok Kim, Seong-Gyun Leem +13

Reasoning & Chain-of-Thought RLHF & Preference Learning Speech & Audio

Azwad Anjum Islam +12w ago

COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives

LLM ensembles, especially when combined with comparative prompting, can achieve human-level performance on subjective semantic evaluation tasks involving substantial inter-annotator disagreement.

Azwad Anjum Islam, Tisa Islam Erana

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Guangfu Hao +32w ago

Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning

LLMs get a reasoning boost from a brain-inspired architecture that dynamically wires up specialized agents, outperforming ReAct and Tree of Thoughts.

Guangfu Hao, Yuming Dai, Xianzhe Qin +1

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

2w ago·also UMich

Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

VLMs can achieve near-perfect anomaly detection in physical systems by incorporating structured physics priors into multi-turn dialogues, a massive leap from previous methods.

Yao Gu, Xiaohao Xu, Yingna Wu

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Timo Freiesleben2w ago

Establishing Construct Validity in LLM Capability Benchmarks Requires Nomological Networks

Claiming LLMs possess human-like reasoning based on benchmarks alone is shaky ground: a nomological network approach offers a more rigorous way to link theoretical capabilities to empirical measurements.

Timo Freiesleben

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Manipal University Jaipur2w ago

Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs

LLMs can solve math problems more efficiently by "thinking" silently in their latent space, adaptively refining their reasoning process only as much as needed, and slashing token usage by over 90%.

Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Reasoning & Chain-of-Thought

Yanick Zengaffinen +52w ago

Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation

LLMs surprisingly mimic human strategies for generating plausible student misconceptions, but their success hinges on first solving the problem correctly.

Yanick Zengaffinen, Andreas Opedal, Donya Rooein +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Haipeng Zhang +12w ago

LLM as Graph Kernel: Rethinking Message Passing on Text-Rich Graphs

LLMs can be directly used as graph kernels for text-rich graphs, enabling message passing on raw text and outperforming methods that rely on static embeddings.

Haipeng Zhang, Peng Di

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Reasoning & Chain-of-Thought

2w ago

RieMind: Geometry-Grounded Spatial Agent for Scene Understanding

Forget end-to-end video understanding: RieMind shows that explicitly grounding LLMs in 3D scene graphs unlocks a 16% jump in spatial reasoning, suggesting structured representations are the key.

Fernando Ropero, Erkin Turkoz, Daniel Matos +6

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Ping Chen +92w ago

Chain-of-Trajectories: Unlocking the Intrinsic Generative Optimality of Diffusion Models via Graph-Theoretic Planning

Diffusion models can be made more efficient and produce better outputs by dynamically allocating compute based on a learned "difficulty" signature, without any retraining.

Ping Chen, Xiang Liu, Xingpeng Zhang +7

Computer Vision Reasoning & Chain-of-Thought World Models & Planning

Ha-Thanh Nguyen +12w ago

PYTHEN: A Flexible Framework for Legal Reasoning in Python

Pythonistas can now easily formalize legal reasoning thanks to PYTHEN, a new framework that brings the power of PROLEG-style defeasible logic to the Python ecosystem.

Ha-Thanh Nguyen, Ken Satoh

Code Generation & Program Synthesis Natural Language Processing Reasoning & Chain-of-Thought

Zhaohui Geoffrey Wang2w ago

Universe Routing: Why Self-Evolving Agents Need Epistemic Control

Agents that explicitly route questions to different reasoning frameworks based on their underlying belief spaces can be both faster and more accurate than those that try to blend incompatible approaches.

Zhaohui Geoffrey Wang

Reasoning & Chain-of-Thought Scalable Oversight & Alignment Theory Tool Use & Agents

2w ago·also NJU

Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

Stop MLLM hallucinations in VideoQA: ClueNet's two-stage training and adaptive clue filtering boosts accuracy by 1.1% while improving interpretability and efficiency.

Xiaohe Li, Jiahao Li, Haohua Wu +2

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Huiyun Peng +62w ago

Beyond Local Code Optimization: Multi-Agent Reasoning for Software System Optimization

LLMs can now orchestrate system-wide optimizations across microservices, boosting throughput by 36% and slashing response times by 27%.

Huiyun Peng, Parth V. Patil, Parth Vinod Patil +4

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Artem Sakhno +112w ago·also Sber AI, Sber AI Lab

Financial Transaction Retrieval and Contextual Evidence for Knowledge-Grounded Reasoning

LLMs can now achieve state-of-the-art performance in transaction analytics by grounding them with a retrieval-augmented knowledge base of behavioral patterns derived from financial transactions.

Artem Sakhno, A. Sakhno, Daniil Tomilov +9

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Mar 15, 2026

Qian Zhu +62w ago

An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs

Insurance LLM slashes hallucinations to a record-low 0.6% while beating DeepSeek and Gemini, proving you *can* have domain mastery without sacrificing general smarts.

Qian Zhu, Xinnan Guo, Jingjing Huo +4

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Deepon Halder +12w ago

Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes

By dynamically adjusting the candidate set size based on Shannon entropy, Top-b offers a more nuanced approach to decoding that balances exploration and exploitation, outperforming static truncation methods.

Deepon Halder, Raj Dabre

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Reasoning & Chain-of-Thought