March 4 – March 11, 2026

Reasoning & Chain-of-Thought - Weekly Roundup

100 papers published across 12 labs.

17% acceleration

Selected Labs publishing this week

Tsinghua AI4 DAMO3 Stanford HAI3 Meta AI3 Microsoft Research2

Top Papers

Mar 9, 2026

3w ago

Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement

LLMs can reason more efficiently by triaging queries and applying deep thought only when truly needed, thanks to a new coarse-to-fine inference framework.

Dongxu Zhang, Hongqiang Lin, Yiding Sun +4

Eval Frameworks & Benchmarks Inference & Quantization Reasoning & Chain-of-Thought

Mar 11, 2026

3w ago

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

Stop wasting RL on easy problems: a difficulty-aware curriculum for SFT and RL unlocks better reasoning in LLMs.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

3w ago

Nurture-First Agent Development: Building Domain-Expert AI Agents Through Conversational Knowledge Crystallization

Forget rigid pipelines and static prompts: Nurture-First Development lets domain experts grow AI agents through conversation, turning tacit knowledge into reusable assets.

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning and Non-Reasoning Rerankers

Reasoning rerankers don't magically fix fairness issues in search, preserving the biases of their input rankings despite boosting relevance.

Saron Samuel, Benjamin Van Durme, Eugene Yang

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Kejin Yu +53w ago

A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms

Autonomous driving's next leap hinges on reasoning, not just perception, but current LLM-based approaches are too slow for real-time control.

Kejin Yu, Yuhan Sun, Taiqiang Wu +3

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

All Papers (100)

Mar 11, 2026

3w ago

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

Stop wasting RL on easy problems: a difficulty-aware curriculum for SFT and RL unlocks better reasoning in LLMs.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

3w ago

Nurture-First Agent Development: Building Domain-Expert AI Agents Through Conversational Knowledge Crystallization

Forget rigid pipelines and static prompts: Nurture-First Development lets domain experts grow AI agents through conversation, turning tacit knowledge into reusable assets.

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning and Non-Reasoning Rerankers

Reasoning rerankers don't magically fix fairness issues in search, preserving the biases of their input rankings despite boosting relevance.

Saron Samuel, Benjamin Van Durme, Eugene Yang

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Kejin Yu +53w ago

A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms

Autonomous driving's next leap hinges on reasoning, not just perception, but current LLM-based approaches are too slow for real-time control.

Kejin Yu, Yuhan Sun, Taiqiang Wu +3

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

3w ago

MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries

Forget brittle KG traversals: MDER-DR's entity-centric summaries and decomposed queries boost multi-hop QA accuracy by up to 66% over standard RAG.

Riccardo Campi, Nicolò Oreste Pinciroli Vago, Mathyas Giudici +2

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Jongwoo Ko +43w ago

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

Achieve up to 12x greater sample efficiency in reasoning tasks by relaxing strict imitation constraints in on-policy distillation, enabling smaller models to match the performance of much larger ones.

Jongwoo Ko, Sara Abdali, Young Jin Kim +2

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

A. Trybus +23w ago

Making Bielik LLM Reason (Better): A Field Report

Can a dedicated research program keep a smaller, local LLM competitive against global giants in the rapidly evolving AI landscape?

A. Trybus, Bartosz Bartnicki, Remigiusz Kinas

Eval Frameworks & Benchmarks Open-Source Models & Weights Reasoning & Chain-of-Thought

Lingxiao Tang +63w ago

ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning

A 7B model, guided by verifiable execution rewards, can now rival the code reasoning of models more than four times its size.

Lingxiao Tang, He Ye, Zhaoyang Chu +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

3w ago

Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments

Unlock massive multilingual reasoning data: the Multilingual Reasoning Gym enables parallel data generation across 14 languages, opening doors for training and evaluating multilingual reasoning models at scale.

Konstantin Dobler, Simon Lehnerer, Federico Scozzafava +2

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Gaodan Fang +63w ago

Trajectory-Informed Memory Generation for Self-Improving Agent Systems

LLM agents can now learn from their mistakes and successes in complex tasks, improving performance by up to 28.5% by extracting and applying structured learnings from past execution trajectories.

Gaodan Fang, Vatche Isahagian, K. Jayaram +4

Reasoning & Chain-of-Thought Tool Use & Agents

Shuyao Shang +113w ago

DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

By forecasting compact world dynamics before taking action, DynVLA leapfrogs traditional CoT methods to achieve more informed and physically grounded autonomous driving decisions.

Shuyao Shang, Binghan Zhan, Yunfei Yan +9

Reasoning & Chain-of-Thought Robotics & Embodied AI World Models & Planning

3w ago

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Uncover the hidden causal chains inside your LLM with Causal Concept Graphs, which outperform existing methods for reasoning by explicitly modeling concept dependencies.

Md Muntaqim Meherab, Noor Islam S. Mohammad, Faiza Feroz

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Zhiyuan Zeng +143w ago

Understanding by Reconstruction: Reversing the Software Development Process for LLM Pretraining

LLMs can be made better software engineers by pre-training them to reconstruct the messy, iterative development process that led to the final, clean code in repositories.

Zhiyuan Zeng, Yichi Zhang, Yong Shan +12

Code Generation & Program Synthesis Data Curation & Synthetic Data Reasoning & Chain-of-Thought

3w ago·also XJTU

HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology

Clinicians using HeartAgent, a cardiology-specific agent system, improved diagnostic accuracy by 26.9% and explanatory quality by 22.7% compared to unaided experts.

Shuang Zhou, Kai Yu, Song Wang +10

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

Multilingual math reasoning just got a serious upgrade: mAceReason-Math offers a meticulously translated and cleaned dataset of challenging problems across 14 languages, purpose-built for RLVR training.

Konstantin Dobler, Simon Lehnerer, Federico Scozzafava +2

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

3w ago

Emulating Clinician Cognition via Self-Evolving Deep Clinical Research

Clinical AI can achieve clinician-level diagnostic accuracy and continuous improvement via a self-evolving framework that actively learns from clinical experience.

Ruiyang Ren, Yuhao Wang, Yunsen Liang +7

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Fan Ding +63w ago

KnowDiffuser: A Knowledge-Guided Diffusion Planner with LM Reasoning and Prior-Informed Trajectory Initialization

By fusing language model reasoning with diffusion-based trajectory generation, KnowDiffuser leapfrogs existing autonomous driving planners on the nuPlan benchmark.

Fan Ding, Xuewen Luo, Fengze Yang +4

Reasoning & Chain-of-Thought Robotics & Embodied AI World Models & Planning

Tsinghua AI3w ago·also DAMO, NanKai University, NJU, Scale +1

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Forget scaling reasoning – this work shows that scaling visual perception using code-grounded data is the real key to unlocking MLLMs' STEM abilities.

Tongkun Guan, Zhibo Yang, Jianqiang Wan +13

Code Generation & Program Synthesis Multimodal Models Reasoning & Chain-of-Thought

School of Information Science and Engineering3w ago·also East China University of Science and Technology

A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification

By grounding LLMs in a hybrid knowledge base and using a Chain of Verification approach, PharmGraph-Auditor turns unreliable LLM generators into transparent reasoning engines for prescription auditing.

Yichi Zhu, K. Ling, Xu Liu +3

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Mar 10, 2026

3w ago

MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning

Explicitly teaching LVLMs to reason step-by-step with reinforcement learning unlocks state-of-the-art performance on multimodal object-entity relation extraction.

Xiang Yuan, Xu Chu, Xinrong Chen +6Code

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

TA-Mem: Tool-Augmented Autonomous Memory Retrieval for LLM in Long-Term Conversational QA

LLMs can now autonomously retrieve relevant memories from a database using specialized tools, significantly improving performance on long-term conversational question answering.

Mengwei Yuan, Jianan Liu, Jing Yang +4

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

360 AI Security Lab3w ago·also Beihang, College of Science

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

LVLMs can be jailbroken by "Reasoning-Oriented Programming," which chains together harmless visual inputs to trigger harmful reasoning, much like return-oriented programming in traditional security exploits.

Quanchen Zou, Moyang Chen, Zonghao Ying +5

Multimodal Models Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Stanford HAI3w ago·also AnsibleHealth Inc., George Washington University

From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring

An AI agent can triage remote patient monitoring data with higher sensitivity than individual clinicians, suggesting a path to scalable and cost-effective patient monitoring.

Seunghwan Kim, Tiffany H. Kung, H. Verma +15

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Google Research3w ago·also AI2, TAU

Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs

Reasoning unlocks factual knowledge in LLMs, but beware: hallucinated reasoning steps can poison the well.

Zorik Gekhman, Roee Aharoni, Eran Ofek +4

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

FAIR CodeGen Team3w ago

Towards a Neural Debugger for Python

LLMs can now emulate debuggers, stepping through code and setting breakpoints, opening the door to more interactive and controllable neural program execution.

Maximilian Beck, Jonas Gehring, Jannik Kossen +1

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Tiehua Mei +73w ago

Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning

Stop training LLMs on lucky guesses: this new RL method uses the model's own in-context learning ability to identify and upweight high-quality reasoning traces, leading to better performance.

Tiehua Mei, Minxuan Lv, Leiyu Pan +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Aman Sharma +13w ago

EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

LLMs that ace standard coding benchmarks spectacularly fail at esoteric languages, revealing a reliance on memorization rather than true reasoning.

Aman Sharma, Paras Chopra

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

3w ago

Latent-DARM: Bridging Discrete Diffusion And Autoregressive Models For Reasoning

By communicating in a shared latent space, Latent-DARM lets you combine the global planning of diffusion models with the fluency of autoregressive models, boosting reasoning accuracy by up to 14% while slashing token usage.

Lina Berrayana, Ahmed Heakl, Abdullah Sohail +1

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

Arash Shahmansoori3w ago

PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution

LLM agents can now achieve a +41pp boost in first-try success and 100% accuracy in 2-way logistics compositions by using PRECEPT's novel combination of retrieval, memory, and prompt evolution.

Arash Shahmansoori

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Andrew Murray +33w ago

GenePlan: Evolving Better Generalized PDDL Plans using Large Language Models

LLMs can evolve surprisingly effective, interpretable Python planners that rival state-of-the-art classical planners, at a fraction of the computational cost.

Andrew Murray, Danial Dervovic, Alberto Pozanco +1

Code Generation & Program Synthesis Reasoning & Chain-of-Thought World Models & Planning

Saugata Purkayastha +33w ago

Common Sense vs. Morality: The Curious Case of Narrative Focus Bias in LLMs

LLMs often choose moral consistency over basic common sense, especially when the contradiction is committed by the main character in a narrative.

Saugata Purkayastha, Pranav Kushare, Pragya Paramita Pal +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Meta AI3w ago·also UT Austin

CREATE: Testing LLMs for Associative Creativity

LLMs struggle to generate diverse and specific connections between concepts, even with high token budgets and "thinking" prompts, revealing a gap in creative associative reasoning.

Manya Wadhwa, Tiasa Singha Roy, Harvey Lederman +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Jiashuo Sun +33w ago

TaSR-RAG: Taxonomy-guided Structured Reasoning for Retrieval-Augmented Generation

Forget brittle multi-hop reasoning: TaSR-RAG's taxonomy-guided triple matching boosts RAG performance by 14% without costly graph construction.

Jiashuo Sun, Yixuan Xie, Jimeng Shi +1

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Naman Gupta +103w ago

Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents

Chain-of-Agents can reason more accurately over long contexts by processing information chunks in an order determined by Chow-Liu dependency trees, rather than relying on default or semantic similarity.

Naman Gupta, Vaibhav Singh, Arun Iyer +8

Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

RecThinker: An Agentic Framework for Tool-Augmented Reasoning in Recommendation

LLM-powered recommendation agents can now autonomously investigate and bridge information gaps, leading to better recommendations, thanks to a new tool-augmented reasoning framework.

Haobo Zhang, Yutao Zhu, Kelong Mao +2

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

3w ago·also IIT Madras, University of Bergen

Learning Bayesian and Markov Networks with an Unreliable Oracle

Even a single error from a conditional independence oracle can prevent the unique identification of a Bayesian network structure, regardless of bounded graph parameters like treewidth.

Juha Harviainen, Pekka Parviainen, Vidya Sagar Sharma

Natural Language Processing Reasoning & Chain-of-Thought

Microsoft Research3w ago·also CUHK

Social-R1: Towards Human-like Social Reasoning in LLMs

A 4B parameter model can now beat much larger models at social reasoning, thanks to a new RL framework that aligns model reasoning trajectories with human cognition.

Jincenzi Wu, Yuxuan Lei, Jianxun Lian +5

Constitutional AI & AI Ethics Natural Language Processing Reasoning & Chain-of-Thought

3w ago

Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

Retrieval-augmented agents get a serious reasoning boost by explicitly evaluating their own retrieval quality at each step, leading to state-of-the-art performance on multi-hop question answering.

Jiangming Shu, Yuxiang Zhang, Ye Ma +2

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Xing Chen +63w ago

Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments

LLMs that dominate in strategic reasoning often choke in real-time zero-sum games, revealing a critical strategy-execution gap that current benchmarks miss.

Xing Chen, Yutao Liu, Gege Qi +4

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments

Stop letting sparse rewards bottleneck your VLN agent: SACA disentangles failed trajectories into valid prefixes and divergence points for dense supervision, unlocking SOTA performance.

Haoyuan Li, Rui Liu, Yi Yang

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

Benjamin Z. Reichman +33w ago

Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing

LLMs' attention patterns subtly shift with emotional tone, and explicitly accounting for these shifts during training improves reading comprehension even on neutral datasets.

Benjamin Z. Reichman, Adar Avasian, Samuel Webster +1

Interpretability & Mechanistic Interp Natural Language Processing Reasoning & Chain-of-Thought

Tsinghua AI3w ago

PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs

Pathology MLLMs can now better incorporate diagnostic standards during reasoning, thanks to a new memory architecture inspired by how human pathologists process information.

Jinyue Li, Yuci Liang, Qiankun Li +5

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

3w ago

World2Mind: Cognition Toolkit for Allocentric Spatial Reasoning in Foundation Models

Text-only foundation models can perform surprisingly well on complex 3D spatial reasoning tasks, rivaling multimodal models, when equipped with a structured spatial representation derived from 3D reconstruction.

Shouwei Ruan, Qihui Zhu, Yuxiang Zhang +1

Multimodal Models Reasoning & Chain-of-Thought World Models & Planning

Petr Grinberg +13w ago

ALARM: Audio-Language Alignment for Reasoning Models

By cleverly "self-rephrasing" LLM outputs, this work coaxes reasoning LLMs to handle audio inputs without sacrificing their chain-of-thought abilities.

Petr Grinberg, Hassan Shahmohammadi

Multimodal Models Reasoning & Chain-of-Thought Speech & Audio

Jiang Gao +53w ago

PM-Nav: Priori-Map Guided Embodied Navigation in Functional Buildings

Achieve up to 11x navigation performance gains in functional buildings by explicitly encoding and exploiting a priori spatial knowledge.

Jiang Gao, Xiangyu Dong, Haozhou Li +3

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

Tong Wang +43w ago

DataFactory: Collaborative Multi-Agent Framework for Advanced Table Question Answering

LLMs can now tackle complex table QA with 20%+ accuracy gains, thanks to a multi-agent framework that decomposes queries and orchestrates reasoning between specialized database and knowledge graph agents.

Tong Wang, Yongkang Chen, Huan Deng +2

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision

A new process reward model acts as a universal geospatial verifier, scaling the performance of both specialized and general-purpose VLMs in remote sensing.

Lang Sun, Ronghao Fu, Zhuoran Duan +2

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

3w ago·also JHU, UCSC

Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control

LLMs can get a 27.8% boost in mathematical reasoning by fusing a hardware-efficient optimal control layer directly into their architecture, enabling planning before prediction.

Peihao Wang, Shanzhe Yang, Shan Yang +9

Distributed Systems & Hardware Reasoning & Chain-of-Thought World Models & Planning

Xiaoxing Wang +23w ago

AutoAgent: Evolving Cognition and Elastic Memory Orchestration for Adaptive Agents

AutoAgent dynamically evolves agent cognition and memory to achieve superior performance in complex, dynamic environments, without requiring external retraining.

Xiaoxing Wang, Shikun Wei, Feiyu Xiong

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

3w ago

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

LLMs can be steered away from hallucination and towards more robust reasoning by using contrastive learning to capture the shared structure of successful reasoning paths, even when the final answer is correct.

Reasoning & Chain-of-Thought RLHF & Preference Learning

Independent3w ago·also Amazon Science, Meta AI, Stanford HAI, Northeastern

The Reasoning Trap -- Logical Reasoning as a Mechanistic Pathway to Situational Awareness

LLM reasoning research is inadvertently paving a dangerous path towards AI situational awareness and strategic deception, demanding a re-evaluation of current safety measures.

Subramanyam Sahoo, Aman Chadha, Vinija Jain +1

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Scalable Oversight & Alignment Theory

Rian Atri3w ago

Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training

Achieve more efficient reasoning in Transformers without increasing test-time cost by using training-only techniques that guide attention and dynamically adjust sharpness.

Rian Atri

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Training Efficiency & Optimization

Ilya Levin3w ago

Vibe-Creation: The Epistemology of Human-AI Emergent Cognition

Human-AI interaction isn't just augmentation, it's a new cognitive entity with its own emergent "vibe," demanding we rethink epistemology and education.

Ilya Levin

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

3w ago·also Lanzhou University

Logos: An evolvable reasoning engine for rational molecular design

By explicitly optimizing for both reasoning structure and chemical consistency, Logos offers a pathway to reliable and interpretable AI systems for molecular science, outperforming larger models with a fraction of the parameters.

Haibin Wen, Fanfu Wang, Tianyi Xu +2

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

3w ago·also Beihang

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

Unlock multimodal interleaved generation in existing vision-language models without large interleaved datasets using a novel reinforcement learning approach with hybrid rewards.

Ming Nie, Chunwei Wang, Jianhua Han +2

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning

Google Research3w ago·also CMU ML, DeepMind

Think Before You Lie: How Reasoning Improves Honesty

LLMs get *more* honest when they have time to reason, defying human tendencies and revealing surprising insights about their internal representational geometry.

Ann Yuan, Asma Ghandeharioun, Carter Blum +6

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

CMU ML3w ago·also CAS

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

LLMs trained with reinforcement learning from verifiable rewards (RLVR) become overconfident in incorrect answers, but a simple fix—decoupling reasoning and calibration objectives—can restore proper calibration without sacrificing accuracy.

Zheng Ma, Zhengzhao Ma, Xueru Wen +9

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

DeepMind3w ago

Quantifying the Necessity of Chain of Thought through Opaque Serial Depth

Mixture-of-Experts models might be hiding more of their reasoning than we thought, thanks to a newly quantified "opaque serial depth" metric.

Jonah Brown-Cohen, David Lindner, Rohin Shah

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Mar 9, 2026

Markus Frey +53w ago

Adaptive Loops and Memory in Transformers: Think Harder or Know More?

Looping helps transformers think harder on math problems, while memory lets them remember more commonsense facts, and combining both beats simply scaling up layers.

Markus Frey, Behzad Shomali, Ali Hamza Bashir +3

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Training Efficiency & Optimization

Matei Benescu +13w ago

Why Large Language Models can Secretly Outperform Embedding Similarity in Information Retrieval

LLMs may secretly be better at information retrieval than embedding similarity suggests, but current datasets are too "short-sighted" to prove it.

Matei Benescu, Ivo Pascal de Jong

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Tsinghua AI3w ago·also DAMO, HIT

Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective

LLMs can switch between reasoning and factual answering on the fly, without retraining, simply by conditioning on specific token prefixes.

Liyuan Mao, Le Yu, Jing Zhou +8

Natural Language Processing Reasoning & Chain-of-Thought RLHF & Preference Learning

3w ago

RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

LLM agents can learn to continuously adapt and improve in complex environments by reflecting on past experiences and explicitly storing/retrieving reusable lessons, leading to substantial performance gains.

Xiaoying Zhang, Zi-Yan Liu, Zichen Liu +3

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

3w ago

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

Even the most advanced LLMs stumble when asked to reason over a large, heterogeneous document corpus, achieving only 34% accuracy on the new OfficeQA Pro benchmark despite direct access to the relevant documents.

Krista Opsahl-Ong, Arnav Singhvi, Jasmine Collins +13

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Fearghal O'Donncha +63w ago

Evidence-Driven Reasoning for Industrial Maintenance Using Heterogeneous Data

LLMs can reliably guide industrial maintenance decisions when constrained by deterministic evidence construction and rule-based verification, even with incomplete and heterogeneous data.

Fearghal O'Donncha, Nianjun Zhou, Natalia Martinez +4

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Xiaofeng Yu +53w ago

Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction

LALMs can now better capture the nuances of human emotion, moving beyond single-label predictions with a new ambiguity-aware training framework that aligns model outputs with the full spectrum of human perception.

Xiaofeng Yu, Xiao Yu, Jiaheng Dong +3

Multimodal Models Reasoning & Chain-of-Thought Speech & Audio

Paulius Rauba +23w ago

Tiny Autoregressive Recursive Models

Despite the buzz around Tiny Recursive Models, directly adapting their refinement mechanism into autoregressive architectures yields no reliable performance boost, suggesting the original TRM's success may stem from other factors.

Paulius Rauba, Claudio Fanconi, Mihaela van der Schaar

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Reasoning & Chain-of-Thought

Akshay Gulati +103w ago

Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines

SuperInvesting, a specialized AI system, significantly outperforms general-purpose LLMs like GPT and Gemini on a new financial intelligence benchmark, suggesting domain-specific architectures are crucial for reliable investment research.

Akshay Gulati, Kanha Singhania, Tushar Banga +8

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Cong Cao +23w ago

A Hierarchical Error-Corrective Graph Framework for Autonomous Agents with LLM-Based Action Generation

LLM-powered agents can now better recover from errors by classifying failure types and retrieving relevant historical contexts.

Cong Cao, Jingyao Zhang, Kun Tong

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

Peijin Xie +33w ago

M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering

Multimodal LLMs struggle with math because they *see* poorly, but a multi-agent system focused on visual evidence can dramatically improve their perception and reasoning.

Peijin Xie, Zhen Xu, Bingquan Liu +1

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

3w ago

Deconstructing Multimodal Mathematical Reasoning: Towards a Unified Perception-Alignment-Reasoning Paradigm

Current multimodal math models struggle with visual interpretation, symbol alignment, and consistent reasoning, highlighting the need for a unified "Perception-Alignment-Reasoning" framework.

Tianyu Yang, Zhenwen Liang, Lisen Dai +1

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Zhijun Wang +63w ago

RexDrug: Reliable Multi-Drug Combination Extraction through Reasoning-Enhanced LLMs

LLMs can now reliably extract complex, n-ary drug combinations from biomedical text, surpassing previous methods that were limited to binary interactions.

Zhijun Wang, Ling Luo, Dinghao Pan +4

Natural Language Processing Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

Jiangye Yuan +23w ago

Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations

MLLMs can gain a surprising boost in 3D spatial reasoning simply by encoding 3D geometric attributes of objects as textual references indexed by unique IDs.

Jiangye Yuan, Gowri Kumar, Baoyuan Wang

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Chun-Hsi Ku +13w ago

Structure-Preserving Graph Contrastive Learning for Mathematical Information Retrieval

Swapping variables in mathematical formulas during graph contrastive learning surprisingly improves retrieval accuracy by preserving crucial algebraic relationships.

Chun-Hsi Ku, Hung-Hsuan Chen

Reasoning & Chain-of-Thought Recommendation & Information Retrieval

3w ago

CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

Stop wasting compute: CODA dynamically adjusts reasoning depth based on problem difficulty, slashing token costs by 60% on easy tasks while boosting performance on hard ones.

Siye Wu, Jian Xie, Yikai Zhang

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

Meta AI3w ago·also JHU

SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

Even the best open-weight LLMs still fail on nearly two-thirds of questions requiring reasoning over scientific tables, highlighting a persistent "execution bottleneck" in translating strategy to action.

Hexuan Wang, Yaxuan Ren, Srikar Bommireddypalli +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Fabio Valerio Massoli3w ago

Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck

Forget token counting: this work introduces a semantic prior based on surprisal to compress LLM reasoning traces, achieving better accuracy and fluency than heuristic length penalties.

Fabio Valerio Massoli

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

Xincheng Yang +23w ago

The Consistency Correctness in CoPPar Tree

Rigorous proof establishes the correctness of CoPPar Tree, guaranteeing consistency in parallel computations.

Xincheng Yang, Kyle C. Hale, Kyle Hale

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought

Stanford HAI3w ago·also Independent Researcher, School of Computing, University of Reading

NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating

Forget bigger models: clever prompt engineering with explicit decision rules crushes fine-tuning and embeddings for word sense disambiguation.

Tong Wu, Tongtong Wu, Thanet Markchom +1

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

3w ago·also Soyeon Caren Han is the corresponding, UWA

BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence

Current LLM benchmarks hide critical reasoning failures in long, multimodal documents, which BRIDGE exposes through step-level evaluation.

Biao Xiang, Soyeon Caren Han, Yihao Ding

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Zhongxing Xu +123w ago

Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

Hallucinations in multimodal reasoning models are linked to high-entropy transition words, and can be reduced by decoding with probability-weighted continuous embeddings rather than discrete tokens during these uncertain states.

Zhongxing Xu, Zhonghua Wang, Zhe Qian +10

Multimodal Models Natural Language Processing Reasoning & Chain-of-Thought

Xiaona Xue +113w ago

CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

LLMs still fail to follow complex instructions that entangle content, formatting, control flow, and real-world constraints, despite progress on simpler benchmarks.

Xiaona Xue, Yiqiao Huang, Jiacheng Li +9

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement

LLMs can reason more efficiently by triaging queries and applying deep thought only when truly needed, thanks to a new coarse-to-fine inference framework.

Dongxu Zhang, Hongqiang Lin, Yiding Sun +4

Eval Frameworks & Benchmarks Inference & Quantization Reasoning & Chain-of-Thought

BAIR3w ago·also UC Santa Cruz

In-Context Reinforcement Learning for Tool Use in Large Language Models

Skip the expensive supervised fine-tuning: this RL-only method teaches LLMs to use tools by showing them how in-context, then gradually removing the crutches until they're tool-using pros in zero-shot.

Yaoqi Ye, Yiran Zhao, Keyu Duan +5

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Liuyi Xu +73w ago

CORE-Acu: Structured Reasoning Traces and Knowledge Graph Safety Verification for Acupuncture Clinical Decision Support

LLMs can now safely navigate the complexities of acupuncture clinical decision support, thanks to a neuro-symbolic framework that slashes safety violations from 8.5% to zero.

Liuyi Xu, Yun Guo, Ming Chen +5

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

3w ago·also TU Wien

Agentic Neurosymbolic Collaboration for Mathematical Discovery: A Case Study in Combinatorial Design

Human-AI collaboration using LLMs and symbolic solvers just cracked a notoriously hard problem in combinatorial design theory, finding a tight lower bound on Latin square imbalance.

Hai Xia, Carla P. Gomes, Bart Selman +1

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Microsoft Research3w ago·also Vanderbilt

Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning

LLMs can slash inference costs by 80% without sacrificing accuracy, simply by learning to recognize when their own reasoning is shaky and needs a second opinion.

Juming Xiong, Kevin Guo, Congning Ni +3

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

Tsinghua AI3w ago

AULLM++: Structural Reasoning with Large Language Models for Micro-Expression Recognition

LLMs can significantly boost micro-expression recognition by reasoning about subtle facial muscle movements when guided by structured visual and relational prompts.

Zhishu Liu, Kaishen Yuan, Bo Zhao +1

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

3w ago

Gradually Excavating External Knowledge for Implicit Complex Question Answering

LLMs can achieve state-of-the-art results on complex reasoning tasks with far fewer parameters by iteratively excavating and reasoning over external knowledge.

Xin Jiang, Qun Liu

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

3w ago

SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning

Achieve up to 52.5% compression in LLM chain-of-thought reasoning *while improving* accuracy by dynamically calibrating CoT length.

Chenzhi Hu, Qinzhe Hu, Yuhang Xu +4

Reasoning & Chain-of-Thought Training Efficiency & Optimization

TOBB University of Economics and Technology3w ago

SPD-RAG: Sub-Agent Per Document Retrieval-Augmented Generation

By decomposing RAG along the document axis with specialized agents, SPD-RAG achieves state-of-the-art performance on multi-document QA while slashing API costs by over 60%.

Yagiz Can Akay, Muhammed Yusuf Kartal, Esra Alparslan +2

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval+1

Microsoft Research3w ago·also Baidu

Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference

Particle filtering reveals a fundamental limit to inference-time sampling methods for LLMs, suggesting that simply increasing the number of samples has diminishing returns.

Noah Golowich, Dhruv Rohatgi, Raghav Singhal +2

Inference & Quantization Reasoning & Chain-of-Thought

Haodong Li +133w ago

CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

Forget fuzzy language – CoCo uses executable code as Chain-of-Thought to generate images with unprecedented control and precision, blowing away existing methods on complex scenes.

Haodong Li, Chunmei Qing, Huanyu Zhang +11

Code Generation & Program Synthesis Multimodal Models Reasoning & Chain-of-Thought

Rafet Sifa +13w ago

Is continuous CoT better suited for multi-lingual reasoning?

Continuous reasoning in latent space crushes explicit reasoning for multilingual tasks, especially when training data is scarce.

Rafet Sifa, David Berghaus

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Chi-Min Chan +73w ago

DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

Strategic data curation using a dual-consensus approach beats brute-force training on large noisy datasets for process reward modeling in biological reasoning.

Chi-Min Chan, Ehsan Hajiramezanali, Xiner Li +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Scientific Discovery & Drug Design

Mar 8, 2026

Abdessalam Bouchekif +73w ago

MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

LLMs still struggle with complex legal reasoning, as evidenced by their difficulty in solving Islamic inheritance cases, even with a new dataset designed to support step-by-step reasoning.

Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani +5

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

NVIDIA3w ago

Reasoning Knowledge-Gap in Drone Planning via LLM-based Active Elicitation

LLMs can orchestrate human input to UAVs, dramatically improving mission success rates while minimizing human interaction.

Zeyu Fang, Beomyeol Yu, Cheng Liu +4

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

NUS3w ago·also DAMO

Verifiable Reasoning for LLM-based Generative Recommendation

LLMs can generate better recommendations if they pause to verify their reasoning steps, rather than reasoning in one long chain.

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

3w ago

ICLR: In-Context Imitation Learning with Visual Reasoning

Robots get smarter at in-context learning by "thinking" visually about future trajectories, leading to better generalization and success rates in manipulation tasks.

Toan Nguyen, Songlin Wei, Hui Li +1

Computer Vision Reasoning & Chain-of-Thought Robotics & Embodied AI

3w ago

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

VLMs can't count blocks because they lack a view-consistent spatial interface, but decomposing scenes into orthographic projections fixes it.

Shaoxiong Zhan, Yanlin Lai, Zheng Liu +5

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

3w ago·also Chongqing Normal University, Corresponding author

Large Language Model for Discrete Optimization Problems: Evaluation and Step-by-step Reasoning

Chain-of-Thought prompting doesn't always improve LLMs' ability to solve discrete optimization problems, and surprisingly, "disordered" datasets can sometimes boost performance on simpler tasks.

Tianhao Qian, Guilin Qi, Z. Y. Wu +3

Eval Frameworks & Benchmarks Open-Source Models & Weights Reasoning & Chain-of-Thought