March 18 – March 25, 2026

Reasoning & Chain-of-Thought - Weekly Roundup

70 papers published across 8 labs.

17% acceleration

Selected Labs publishing this week

Meta AI4 Mila2 Tsinghua AI2 MIT CSAIL1 Amazon Science1

Top Papers

Mar 19, 2026

Indian Institute of Information Technology1w ago

ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

LLM explanation faithfulness varies wildly depending on how you test it, and might even be *anti*-faithful, so stop relying on single-intervention benchmarks.

Abhinaba Basu, Abhinaba Basu, Pavan Chakraborty +1

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Mar 25, 2026

MIT CSAIL1w ago

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

LMs can learn to generate multiple plausible answers in a single forward pass, outperforming traditional single-answer models on tasks requiring distributional reasoning and offering a compute-efficient alternative to best-of-k sampling.

Isha Puri, Mehul Damani, Idan Shenfeld +3

Natural Language Processing Reasoning & Chain-of-Thought RLHF & Preference Learning

Mar 24, 2026

Royden Wagner +201w ago

LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

Training data is not enough: reasoning traces from diverse cultural backgrounds are critical for safe and reliable autonomous driving in rare, long-tail scenarios.

Royden Wagner, O. Tas, Jaime Villa +18

Data Curation & Synthetic Data Multimodal Models Reasoning & Chain-of-Thought

Mar 23, 2026

B active) differ by an order of magnitude in active parameters. Conversely1w ago

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

Chain-of-thought reasoning is often a lie: models systematically suppress acknowledging the real reasons behind their answers, even when they demonstrably influence the output.

Richard J. Young

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Mar 20, 2026

Chiyu Ma +91w ago

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

LLMs can reason through chains of thought 2.5x longer and achieve 8% higher accuracy on complex math problems by optimizing for token-level influence on future trajectory behavior.

Chiyu Ma, Shuo Yang, Kexin Huang +7

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

All Papers (70)

Mar 25, 2026

MIT CSAIL1w ago

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

Isha Puri, Mehul Damani, Idan Shenfeld +3

Natural Language Processing Reasoning & Chain-of-Thought RLHF & Preference Learning

Mar 24, 2026

Royden Wagner +201w ago

LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

Training data is not enough: reasoning traces from diverse cultural backgrounds are critical for safe and reliable autonomous driving in rare, long-tail scenarios.

Royden Wagner, O. Tas, Jaime Villa +18

Data Curation & Synthetic Data Multimodal Models Reasoning & Chain-of-Thought

Mar 23, 2026

B active) differ by an order of magnitude in active parameters. Conversely1w ago

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

Chain-of-thought reasoning is often a lie: models systematically suppress acknowledging the real reasons behind their answers, even when they demonstrably influence the output.

Richard J. Young

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Mar 20, 2026

Chiyu Ma +91w ago

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

LLMs can reason through chains of thought 2.5x longer and achieve 8% higher accuracy on complex math problems by optimizing for token-level influence on future trajectory behavior.

Chiyu Ma, Shuo Yang, Kexin Huang +7

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Mar 19, 2026

Gagan Bhatia +31w ago

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

LLMs' temporal reasoning crumbles in low-resource languages and rarer calendar formats, not due to a lack of reasoning ability, but because poor tokenization fragments dates and times.

Gagan Bhatia, Ahmad Muhammad Isa, Maxime Peyrard +1

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Zhixing You +21w ago

D-Mem: A Dual-Process Memory System for LLM Agents

LLM agents can achieve near-perfect memory recall without prohibitive costs by strategically combining fast, lossy retrieval with slower, exhaustive deliberation.

Zhixing You, Jiachen Yuan, Jason Cai

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Shaked Perek +41w ago

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

Skip reinforcement learning and still get SOTA vision-language reasoning performance with a simple loss re-weighting scheme that cuts training time by 7x.

Shaked Perek, Ben Wiesel, Avihu Dekel +2

Multimodal Models Reasoning & Chain-of-Thought Training Efficiency & Optimization

1w ago

Context Bootstrapped Reinforcement Learning

Injecting demonstrations with a carefully annealed probability can drastically improve exploration in RLVR, even for tasks requiring novel reasoning or domain-specific knowledge.

Saaket Agashe, Jayanth Srinivasa, Gaowen Liu +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents+1

Zou Qiang +11w ago

Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

LLMs can maintain reasoning boundaries with >99% reliability under adversarial attacks when equipped with explicit process-control layers, a massive improvement over standard RLHF.

Zou Qiang, Zou Qiang

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

1w ago·also Penn State, Virginia Tech

Implicit Patterns in LLM-Based Binary Analysis

LLMs analyzing binaries aren't just spitting out tokens – they're exhibiting surprisingly structured reasoning patterns like "early pruning" and "targeted backtracking" that could revolutionize how we understand and control these systems.

Qiang Li, XiangRui Zhang, Haining Wang

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

D. Compagno +21w ago·also University of Bergen

Teleological Inference in Structural Causal Models via Intentional Interventions

Discovering an agent's hidden intentions is now possible by analyzing their interventions within a causal model, revealing the "why" behind their actions.

D. Compagno, Dario Compagno, Fabio Massimo Zennaro

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

Youngwan Lee +51w ago

MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

Current VLMs struggle with multi-hop spatial reasoning, often failing to compose even simple spatial relations across multiple steps, highlighting a critical gap for real-world VLA agent deployment.

Youngwan Lee, Soojin Jang, Yoorhim Cho +3

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Xiaoyang Chen +11w ago

Can LLM generate interesting mathematical research problems?

LLMs can generate novel mathematical research problems in differential geometry that experts find both unknown and valuable, suggesting a new avenue for AI-assisted mathematical discovery.

Xiaoyang Chen, Xiang Jiang

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

1w ago·also Amazon Science, NSFC

MemMA: Coordinating the Memory Cycle through Multi-Agent Reasoning and In-Situ Self-Evolution

Memory-augmented LLMs get a strategic upgrade: MemMA uses multi-agent reasoning to proactively guide memory construction and repair, leading to significant performance gains.

Min Lin, Minhua Lin, Zhiwei Zhang +7

Reasoning & Chain-of-Thought Tool Use & Agents

1w ago

Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning

Strategic visual aids are the secret weapon for geometric reasoning, and this work shows how to teach MLLMs to wield them effectively via reinforcement learning.

Haokun Zhao, Wanshi Xu, Haidong Yuan +3

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Mila1w ago

Learning to Self-Evolve

Forget prompt engineering – LSE trains LLMs to self-edit their own contexts at test time, outperforming even GPT-5 and Claude Sonnet 4.5 in Text-to-SQL and question answering.

Xiaoyin Chen, Xiaoyin Chen, Canwen Xu +9

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Dimitrios Georgousis +51w ago

Evaluating Counterfactual Strategic Reasoning in Large Language Models

LLMs that appear strategically savvy in standard games often crumble when faced with slight rule changes, suggesting they're mimicking rather than truly reasoning.

Dimitrios Georgousis, Maria Lymperaiou, Angeliki Dimitriou +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

1w ago

Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Fine-tuning LVLMs on counting alone boosts general visual reasoning by over 1.5%, revealing counting as a surprisingly central skill.

Michelle Hurst

Interpretability & Mechanistic Interp Multimodal Models Reasoning & Chain-of-Thought

1w ago

Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Multimodal LLMs suffer a major performance hit when asked to switch from text-based to image-based tasks mid-conversation, revealing a surprising asymmetry in their ability to handle task interference.

Masayuki Kawarada, Masayuki Kawarada, Tatsuya Ishigaki +3

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Google Research1w ago·also University of Georgia, UT Austin, Vienna

Geography According to ChatGPT -- How Generative AI Represents and Reasons about Geography

ChatGPT's geographic reasoning can be surprisingly brittle, with minor syntactic changes causing significant output variations and task composition revealing unexpected distributional shifts.

Krzysztof Janowicz, Gengchen Mai, Rui Zhu +4

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Tsinghua AI1w ago·also Guangdong Laboratory of AI and Digital Economy (SZ), Independent Researcher, PolyU, SYSU +1

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

MLLMs can ace the test, but still fail to *see*—they often succeed at complex reasoning with symbols while failing at basic symbol recognition, revealing a reliance on linguistic priors over true visual perception.

Yinghui Li, Jiayi Kuang, Peng Xing +11

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Haitian Li +101w ago

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Unlock real-time 3D understanding: MonoArt achieves state-of-the-art monocular articulated object reconstruction without relying on multi-view data or external motion templates.

Haitian Li, Haozhe Xie, Haozhe Xie +8

Computer Vision Reasoning & Chain-of-Thought Robotics & Embodied AI

Maksym Del +91w ago·also University of Tartu

How Uncertainty Estimation Scales with Sampling in Reasoning Models

Two heads are better than one: combining verbalized confidence and self-consistency with just two samples dramatically boosts uncertainty estimation in reasoning models, beating either signal alone even with much larger sampling budgets.

Maksym Del, Markus Kängsepp, Markus Kangsepp +7

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Scaling Laws & Emergent Abilities

Xinghao Zhao1w ago

Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

LLMs' chain-of-thought reasoning is more reliable when the uncertainty (entropy) decreases consistently at each step, not just overall.

Xinghao Zhao

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Indian Institute of Information Technology1w ago

ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

LLM explanation faithfulness varies wildly depending on how you test it, and might even be *anti*-faithful, so stop relying on single-intervention benchmarks.

Abhinaba Basu, Abhinaba Basu, Pavan Chakraborty +1

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

1w ago·also HKU

Parallelograms Strike Back: LLMs Generate Better Analogies than People

LLMs aren't just regurgitating facts; they're actually better at generating high-quality, relation-preserving word analogies than humans.

Qiawen Liu, Qiawen Ella Liu, Raja Marjieh +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Chenyang Gu +61w ago

MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

LLMs can generate significantly more novel and technically rigorous scientific ideas by explicitly learning to reason from motivations to methodologies.

Chenyang Gu, Jiahao Cheng, Meicong Zhang +4

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Meta AI1w ago

TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Achieve significant reasoning gains in frozen LLMs (+22.4%) without retraining by adaptively routing reward model guidance at the token level during inference.

Arushi Rai, Qiang Zhang, Hanqing Zeng +4

Inference & Quantization Reasoning & Chain-of-Thought RLHF & Preference Learning

Gabriele Carrino +61w ago

Are complicated loss functions necessary for teaching LLMs to reason?

Stripping away the complexity of GRPO reveals that simple REINFORCE with group relative advantage can actually *improve* LLM reasoning, challenging the assumption that sophisticated loss functions are always better.

Gabriele Carrino, Andrea Sassella, Nicolò Brunello +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

1w ago

ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

Achieve topologically coherent coronary vessel segmentation by directly optimizing for geometric structure, rather than pixel-wise accuracy, using preference-based learning.

Zhanpeng Jin, Zhan Jin, Yu Luo +7

Computer Vision Reasoning & Chain-of-Thought RLHF & Preference Learning+1

Bishoy M. Galoaa +41w ago

Motion-o: Trajectory-Grounded Video Reasoning

Visual language models can now explicitly reason about object trajectories in videos, thanks to a simple yet effective method that augments training data and uses discrete motion tags.

Bishoy M. Galoaa, Bishoy Galoaa, Shayda Moezzi +2

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

CMU ML1w ago

ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

Even GPT-5 and Gemini 2.5 Pro still fail to efficiently couple reasoning with tool use, requiring up to 2.7x more tool calls than theoretically optimal in a new diagnostic environment.

Wanjia Zhao, Ludwig Schmidt, James Zou +2

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Nitay Alon +41w ago

Proceedings of the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind

A snapshot of the cutting-edge research uniting Theory of Mind and AI, all in one open-access collection.

Nitay Alon, Joseph M. Barnby, Reuth Mirsky +2

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Yinan Xia +31w ago

Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

LRMs can be made more efficient and accurate by strategically adjusting their output length based on task difficulty, leading to a better accuracy-length trade-off.

Yinan Xia, Haotian Zhang, Huiming Wang +1

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Yan Shu +81w ago

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Pixel-perfect geospatial reasoning is now possible, thanks to a vision-language model that adaptively fuses multi-modal and multi-temporal Earth observation data.

Yan Shu, Bin Ren, B. Ren +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

D. Ben-Ami +31w ago·also Ben-Gurion University of the Negev

HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Get GPT-4o-level long-video QA performance with 10x fewer FLOPs by using a hierarchical, training-free frame selector that combines multimodal experts and fuzzy logic.

D. Ben-Ami, Gabriele Serussi, Kobi Cohen +1

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

1w ago·also Radboud, UvA

Total Recall QA: A Verifiable Evaluation Suite for Deep Research Agents

Current benchmarks fail to rigorously evaluate deep research agents, but a new framework leveraging structured knowledge bases and synthetic data offers a verifiable and scalable solution.

Mahta Rafiee, Heydar Soudani, Zahra Abbasiantaeb +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval+1

Huichi Zhou +171w ago·also Jilin Univerisity

Memento-Skills: Let Agents Design Agents

Forget hand-crafting agents: Memento-Skills lets a generalist LLM agent autonomously design and improve specialized agents through experience, achieving substantial gains on complex benchmarks.

Huichi Zhou, Siyuan Guo, Anjie Liu +15

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Meta AI1w ago·also CMU ML, CAS, UNC

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

On-policy reward modeling with LLM judges not only unlocks significant performance gains on complex mathematical reasoning tasks, but also generalizes to improve performance on simpler numerical and multiple-choice benchmarks.

Pranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim +22

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

NVIDIA1w ago·also HKUST, Waterloo

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

A 30B MoE model can now achieve Gold Medal-level performance in IMO, IOI, and ICPC, rivaling frontier models with 20x more parameters.

Zhuolin Yang, Zhuoling Yang, Zihan Liu +29

Code Generation & Program Synthesis Reasoning & Chain-of-Thought RLHF & Preference Learning

1w ago

Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

Stop retrieving background noise: HCQR refines RAG by generating targeted queries that seek evidence to directly support or refute candidate answers.

Hangeol Chang, Changsun Lee, Changsu Lee +5

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Xiao Feng +71w ago

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

Skip the expensive reward model: RewardFlow distills sparse task rewards into dense, state-level signals by propagating credit through the topology of LLM reasoning trajectories.

Xiao Feng, Bo Han, Zhanke Zhou +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Yogesh Agrawal +51w ago

FinTradeBench: A Financial Reasoning Benchmark for LLMs

LLMs still struggle to reason about financial time-series data, even when they ace the textual fundamentals.

Yogesh Agrawal, Aniruddha Dutta, Mahadi Hasan +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

KT Tech innovation Group1w ago

Mi:dm K 2.5 Pro

Forget scaling laws: Mi:dm K 2.5 Pro proves that targeted training pipelines and data curation can enable a 32B parameter model to achieve state-of-the-art performance in enterprise reasoning tasks, especially in low-resource languages like Korean.

KT Tech innovation Group

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Mar 18, 2026

Young-Bin Park +12w ago

Graph-Native Cognitive Memory for AI Agents: Formal Belief Revision Semantics for Versioned Memory Architectures

Forget finetuning – Kumiho's graph-native memory lets you swap in a better LLM and instantly double your agent's reasoning accuracy on complex cognitive tasks.

Young-Bin Park, Young Bin Park

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

Tianhui Zhang +22w ago

Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

Training on synthetically generated data can significantly boost both the diversity and quality of commonsense reasoning in LLMs, outperforming models trained on scarce human-annotated data.

Tianhui Zhang, Bei Peng, D. Bollegala

Data Curation & Synthetic Data Natural Language Processing Reasoning & Chain-of-Thought

Guangzhi Wang +32w ago

CRE-T1 Preview Technical Report: Beyond Contrastive Learning for Reasoning-Intensive Retrieval

Ditch static embeddings: Generative retrieval, powered by reinforcement learning, lets models dynamically reason about relevance, outperforming larger contrastively-trained models on reasoning-intensive tasks.

Guangzhi Wang, Ying Jiao, Yinghao Jiao +1

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

2w ago

A Unified Language Model for Large Scale Search, Recommendation, and Reasoning

Forget tool-augmented systems: NEO shows you can consolidate search, recommendation, and reasoning into a single language-steerable LLM by representing items as SIDs and interleaving them with natural language.

Marco De Nadai, Edoardo D'Amico, Max Lefarov +23

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

2w ago·also Netease Yidun AI Lab ∗Equal contribution

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Achieve state-of-the-art fine-grained visual recognition without training by adaptively invoking reasoning in a Large Vision-Language Model only when needed, significantly reducing computational overhead.

Jingxiao Yang, DaLin He, Miao Pan +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

2w ago

Proactive Knowledge Inquiry in Doctor-Patient Dialogue: Stateful Extraction, Belief Updating, and Path-Aware Action Planning

Instead of passively transcribing doctor-patient dialogues, this system actively models what's known, what's missing, and what questions to ask next, paving the way for more intelligent EMR systems.

Zhenhai Pan, Yan Liu, Jia You

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Chengwei Wei +42w ago

InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

LLMs can achieve state-of-the-art reasoning accuracy with significantly fewer tokens by rewarding intermediate reasoning steps that maximize information gain and maintain monotonic progress.

Chengwei Wei, Jung-jae Kim, Longyin Zhang +2

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

2w ago·also NTU, UQ

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

Chain-of-thought prompting makes large language models smarter, but it also makes them less safe, a problem this paper tackles by forcing models to think about safety *before* reasoning.

Jianan Chen, Zhifang Zhang, Shuo He +3

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

2w ago·also Shenzhen University

TRiMS: Real-Time Tracking of Minimal Sufficient Length for Efficient Reasoning via RL

LLMs can slash over 80% of their chain-of-thought tokens with a minor accuracy boost, thanks to a new RL-based method that targets the "Minimal Sufficient Length" of reasoning.

Tingcheng Bian, Jinchang Luo, Mingquan Cheng +5

Reasoning & Chain-of-Thought Training Efficiency & Optimization

Haozheng Luo +32w ago

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

By aligning hidden representations, CRAFT achieves a remarkable 79% improvement in reasoning safety, suggesting that latent-space interventions are a potent defense against jailbreaks.

Haozheng Luo, Yimin Wang, Jiahao Yu +1

Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Aivo Olev +22w ago·also TalTech

Multi-Source Evidence Fusion for Audio Question Answering

Grounding LALM reasoning in diverse, reliability-weighted acoustic evidence blows away the competition in Audio Question Answering, proving that verifiable chains beat black boxes.

Aivo Olev, Tanel Alumae, Tanel Alumäe

Reasoning & Chain-of-Thought Speech & Audio Tool Use & Agents

Risham Sidhu +32w ago

Grid Spatial Understanding: A Dataset for Textual Spatial Reasoning over Grids, Embodied Settings, and Coordinate Structures

LLMs struggle with spatial reasoning in embodied settings and 3D structure identification even when exposed to visual modalities, but fine-tuning smaller models offers a surprisingly effective alternative to brute-force scaling.

Risham Sidhu, Risham Sidhu, Julia Hockenmaier +1

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Robotics & Embodied AI

Kevin Qu +52w ago

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Forget expensive 3D training data: Loc3R-VLM shows how to give 2D vision-language models strong 3D spatial reasoning by distilling knowledge from a pretrained 3D foundation model using only monocular video.

Kevin Qu, Haozhe Qi, Mihai Dusmanu +3

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Tsinghua AI2w ago·also Meta AI, Mila

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

Mimicking human cognition, FLAIR lets dialogue models "think while listening," boosting performance without adding latency.

Donghang Wu, Tianyu Zhang, Yuxin Li +6

Natural Language Processing Reasoning & Chain-of-Thought Speech & Audio

Akshat Rana +32w ago

SG-CoT: An Ambiguity-Aware Robotic Planning Framework using Scene Graph Representations

Scene graphs plus LLMs let robots ask clarifying questions, boosting multi-agent task success by 15%.

Akshat Rana, Peeyush Agarwal, Krishan Rana +1

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

Zichen Xie +12w ago

Can LLMs Reason Like Automated Theorem Provers for Rust Verification? VCoT-Bench: Evaluating via Verification Chain of Thought

LLMs can't reason their way through Rust verification, struggling to complete proofs even with substantial hints, revealing a critical gap in their ability to handle the rigorous demands of secure software development.

Zichen Xie, Wenxi Wang

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

2w ago

CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

LLMs can escape the trap of confidently wrong reasoning by co-evolving a generator and verifier from a single model, bootstrapping each other to break free from flawed consensus.

Teng Pan, Tengyu Pan, Yuchen Yan +7

Reasoning & Chain-of-Thought RLHF & Preference Learning

2w ago

Retrieval-Augmented LLM Agents: Learning to Learn from Experience

Retrieval-augmented LLM agents can learn to learn from experience, achieving significantly better generalization on unseen tasks by combining the strengths of fine-tuning and in-context retrieval.

Thomas Palmeira Ferraz, Romain Deffayet, Vassilina Nikoulina +2

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

2w ago

PJB: A Reasoning-Aware Benchmark for Person-Job Retrieval

Stop chasing leaderboard gains on generic benchmarks: PJB reveals that domain-specific weaknesses in person-job retrieval far outweigh the benefits of general model upgrades, and that query understanding modules can actually hurt performance.

Guangzhi Wang, Xiaohui Yang, Kai Li +4

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Hyun Ryu +52w ago·also KAIST

Argument Reconstruction as Supervision for Critical Thinking in LLMs

Training LLMs to reconstruct arguments boosts their critical thinking abilities across diverse tasks, suggesting a promising new direction for imbuing reasoning skills.

Hyun Ryu, Gyouk Chu, Gregor Betz +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Mohsen Arjmandi2w ago

Sensi: Learn One Thing at a Time -- Curriculum-Based Test-Time Learning for LLM Game Agents

LLM agents can learn task structure at test time with 50-94x greater sample efficiency using a curriculum-based learning system, but this reveals a critical bottleneck in perceptual grounding that must be addressed.

Mohsen Arjmandi

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Jingchun Yang is with Northeast2w ago

Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning

Dashcam videos can now be directly linked to legal responsibility determinations via a novel multimodal dataset and legal reasoning framework, outperforming existing LLMs and agent-based systems.

Jingchun Yang, Jinchang Zhang

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

2w ago·also Meta AI, UNC

Text-to-Stage: Spatial Layouts from Long-form Narratives

LLMs can now infer plausible stage layouts from unstructured text alone, opening up new possibilities for automated media production.

Jefferson Hernandez, Swarnadeep Saha, Chenxi Whitehouse +8

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

2w ago·also PolyU

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

Forget prompt engineering: AgentFactory lets LLM agents self-evolve by accumulating and refining executable Python subagents, making task re-execution more reliable and efficient.

Zhang Zhang, Shuqi Lu, Hongjin Qian +2

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

2w ago·also Didichuxing Co. Ltd

A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication

An 8B parameter model, RideJudge, outperforms 32B baselines in ride-hailing dispute adjudication by aligning visual semantics with evidentiary protocols, achieving 88.41% accuracy.

Weiming Wu, Zi-Jian Cheng, Jie Meng +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Corentin Royer +52w ago

Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain

Forget expensive human annotations – this new method uses information theory to automatically score each step of an LLM's reasoning process, making chain-of-thought supervision scalable and efficient.

Corentin Royer, D. Bhattacharjya, Debarun Bhattacharjya +3

Reasoning & Chain-of-Thought RLHF & Preference Learning