April 20 – April 27, 2026

Reasoning & Chain-of-Thought - Weekly Roundup

100 papers published across 5 labs.

397% acceleration

Selected Labs publishing this week

Tsinghua AI3 Stanford HAI1 DAMO1 ETH1 Google Research1

Top Papers

Apr 27, 2026

Pampanga State UniversityApr 27, 2026·also College of Computing Studies, Don Honorio Ventura State University, National University, University of the East

Towards the Development of Detection of Learned Helplessness in Mathematics: Design and Data Collection Challenges from a Developing Country Perspective

Building AI tutors in the real world is hard: outdated tech, spotty internet, and curriculum gaps can derail even the best-designed systems.

John Paul P. Miranda, J. P. P. Miranda, Rex P. Bringula +13

Natural Language Processing Reasoning & Chain-of-Thought

Iizalaarab Elhaimeur +3Apr 27, 2026

From Prototype to Classroom: An Intelligent Tutoring System for Quantum Education

Quantum education gets a boost: specialized LLM agents in a classroom setting not only improve tutoring reliability but also reveal hidden curriculum gaps.

Iizalaarab Elhaimeur, Iizalaarab Elhaimeur, Nikos Chrisochoides +1

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Apr 27, 2026·also Tsinghua AI

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

LLM agents can better discover and assess risks of skills when those skills are represented in a structured format that explicitly represents scheduling, execution structure, and logic, rather than relying on unstructured text.

Qiliang Liang, Hansi Wang, Zhongzhi Liang +1

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Apr 23, 2026

Apr 23, 2026·also Meituan

Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

Test-time RL's vulnerability to noisy pseudo-labels is amplified by group-relative advantage estimation, but can be mitigated with a surprisingly simple debiasing and denoising approach.

Yongcan Yu, Lingxiao He, Jian Liang +5

Reasoning & Chain-of-Thought RLHF & Preference Learning

Meghyn Bienvenu +3Apr 23, 2026

Using ASP(Q) to Handle Inconsistent Prioritized Data

Finally, a practical implementation for globally-optimal repair-based semantics allows for querying inconsistent prioritized data with theoretical guarantees.

Meghyn Bienvenu, Camille Bourgaux, Robin Jean +1

Natural Language Processing Reasoning & Chain-of-Thought

All Papers (100)

Apr 27, 2026

Apr 27, 2026·also Tsinghua AI

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

Qiliang Liang, Hansi Wang, Zhongzhi Liang +1

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Shiyi Zhang +10Apr 27, 2026

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Decomposing image editing tasks into meta-tasks and aligning model reasoning with editing behavior unlocks surprising generalization to unseen editing operations.

Shiyi Zhang, Yiji Cheng, Tiankai Hang +8

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Abhijay Deevi +5Apr 27, 2026

CAN-QA: A Question-Answering Benchmark for Reasoning over In-Vehicle CAN Traffic

LLMs can parrot CAN bus data, but CAN-QA reveals they fail at the temporal reasoning and multi-condition inference needed for real-world vehicle security forensics.

Abhijay Deevi, Abhijay Deevi, Onat Gungor +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Shiyi Du +8Apr 27, 2026

Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors

Forget expensive per-task search: agentic workflows can be synthesized in a single LLM pass by transferring learned structural priors, slashing optimization costs by 3 orders of magnitude.

Shiyi Du, Jiayuan Liu, Weihua Du +6

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Sreehari Sankar +10Apr 27, 2026

Analyzing LLM Reasoning to Uncover Mental Health Stigma

LLMs harbor surprisingly nuanced and pervasive mental health stigma, revealed only by dissecting their reasoning steps, not just their final answers.

Sreehari Sankar, Aliakbar Nafar, M. Barman +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Apr 27, 2026·also DFKI

Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models

RL's superior generalization isn't about brute force, but about carefully sculpting a few key features while preserving the base model's knowledge, unlike SFT's rapid specialization.

Dan Shi, S. Ostermann, Renren Jin +2

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought RLHF & Preference Learning

Sercan Karakacs +1Apr 27, 2026

Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation

LLMs fail to reliably track source trustworthiness in Turkish evidential marking, unlike humans, highlighting a critical gap in their ability to perform nuanced reasoning based on source reliability.

Sercan Karakacs, Yusuf cSimcsek

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Daneshvar Amrollahi +2Apr 27, 2026

Faithful Autoformalization via Roundtrip Verification and Repair

LLMs can now formalize natural language with significantly higher fidelity, thanks to a clever roundtrip verification method that self-diagnoses and repairs translation errors.

Daneshvar Amrollahi, Jerry Lopez, Clark W. Barrett

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Sagnik Chatterjee +2Apr 27, 2026

Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs

Small language models can achieve reasoning performance rivaling larger models, even under tight token budgets, by using a lightweight "guidance track" to strategically prune and refine their chain-of-thought reasoning.

Sagnik Chatterjee, Atharva Patil, S. Ramesh

Inference & Quantization Natural Language Processing Reasoning & Chain-of-Thought

Apr 27, 2026

Don\'t Stop Early: Scalable Enterprise Deep Research with Controlled Information Flow and Evidence-Aware Termination

Dependency-controlled context and explicit evidence sufficiency criteria are key to preventing premature stopping and improving the consistency of enterprise research outputs.

Prafulla Kumar Choubey, Kung-Hsiang Huang, P. Venkit +4

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Apr 27, 2026

Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

LLMs still can't pass history class: even state-of-the-art models struggle with complex historical reasoning, as revealed by a new benchmark based on the Chinese Imperial Examination.

Lirong Gao, Zeqing Wang, Yuyan Cai +6

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Soyeon Kim +5Apr 27, 2026

K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

Scaling up LLMs doesn't guarantee expertise: Korean-specific models beat larger global models on a new meteorology benchmark, exposing critical gaps in multimodal reasoning and cultural understanding.

Soyeon Kim, Cheon-kyu Kang, Myeongjin Lee +3

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Pampanga State UniversityApr 27, 2026·also College of Computing Studies, Don Honorio Ventura State University, National University, University of the East

Towards the Development of Detection of Learned Helplessness in Mathematics: Design and Data Collection Challenges from a Developing Country Perspective

Building AI tutors in the real world is hard: outdated tech, spotty internet, and curriculum gaps can derail even the best-designed systems.

John Paul P. Miranda, J. P. P. Miranda, Rex P. Bringula +13

Natural Language Processing Reasoning & Chain-of-Thought

Iizalaarab Elhaimeur +3Apr 27, 2026

From Prototype to Classroom: An Intelligent Tutoring System for Quantum Education

Quantum education gets a boost: specialized LLM agents in a classroom setting not only improve tutoring reliability but also reveal hidden curriculum gaps.

Iizalaarab Elhaimeur, Iizalaarab Elhaimeur, Nikos Chrisochoides +1

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Zijun Feng +6Apr 27, 2026·also School of Cyber Science and Technology, SYSU

GoAT-X: A Graph of Auditing Thoughts for Securing Token Transactions in Cross-Chain Contracts

LLMs can now audit cross-chain smart contracts with expert-level precision, achieving 95% coverage of vulnerable projects by explicitly mirroring human reasoning processes.

Zijun Feng, Yuming Feng, Yu Wang +4

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Srita Padmanabhuni +4Apr 27, 2026

FGDM: Reasoning Aware Multi-Agentic Framework for Software Bug Detection using Chain of Thought and Tree of Thought Prompting

LLMs can find and fix bugs in complex codebases far better when structured as a team of reasoning agents, outperforming existing methods by a large margin.

Srita Padmanabhuni, Bhargavi Karuturi, Jerusha Karen Indupalli +2

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Fondazione Bruno KesslerApr 27, 2026·also IISc

Logic of Fuzzy Paths

Separating geometry from logic with fuzzy path constraints yields motion planning specifications that are both more intuitive for humans and more amenable to learning from demonstrations.

K. Grover, Pratham Gupta, Jan Kvret'insk'y

Reasoning & Chain-of-Thought Robotics & Embodied AI World Models & Planning

Zhuoling Li +3Apr 27, 2026

XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

GraphRAG's black-box reasoning gets a spotlight: XGRAG reveals how specific knowledge graph components influence LLM outputs, boosting explanation quality by 14.81% over standard RAG explainability methods.

Zhuoling Li, Ha Nguyen, Valeria Bladinieres +1

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Apr 27, 2026·also ZJU

Improving Vision-language Models with Perception-centric Process Reward Models

VLMs can be taught to self-correct hallucinations at the token level, leading to substantial gains in reasoning accuracy across diverse benchmarks.

Yingqian Min, Kun Zhou, Yifan Li +6

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning

Apr 27, 2026·also Fudan, Michigan State, XJTU, ZJU

SEARCH-R: Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator for Multi-hop Question Answering

Stop relying on LLMs to "hallucinate" reasoning paths – SEARCH-R uses a fine-tuned Llama3.1-8B model and dependency tree-based retrieval to navigate multi-hop question answering more reliably.

Yuqing Fu, Yimin Deng, Yimin Deng +13

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Apr 24, 2026

Stanford HAIApr 24, 2026

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

LLMs can't handle the truth: SLIDERS beats GPT-4.1 on long-context QA by sidestepping the context window entirely.

Harshit Joshi, Priyank Shethia, Jadelynn Dao +1

Natural Language Processing Reasoning & Chain-of-Thought

Shaoang Li +12Apr 24, 2026

Learning Evidence Highlighting for Frozen LLMs

Highlighting pivotal evidence can boost LLM performance without altering the original context, leading to substantial improvements in reasoning tasks.

Shaoang Li, Yanhang Shi, Yufei Li +10

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Apr 23, 2026

Ceyuan Yang +19Apr 23, 2026

Context Unrolling in Omni Models

Training a single model across text, images, video, 3D geometry, and hidden representations unlocks "Context Unrolling," where the model reasons across modalities to improve reasoning fidelity.

Ceyuan Yang, Zhijie Lin, Yang Zhao +17

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Reasoning & Chain-of-Thought

Kaitlin Gili +3Apr 23, 2026

Locating acts of mechanistic reasoning in student team conversations with mechanistic machine learning

Inductive biases make machine learning models better at spotting mechanistic reasoning in student discussions, even when those students are tackling new problems.

Kaitlin Gili, Mainak Nistala, Kristen Wendell +1

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Buqiang Xu +6Apr 23, 2026

StructMem: Structured Memory for Long-Horizon Behavior in LLMs

LLMs can now reason across long conversations without breaking the bank: StructMem slashes token usage and API calls while boosting temporal reasoning.

Buqiang Xu, Yijun Chen, Jizhan Fang +4

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

Maximilian Westermann +8Apr 23, 2026·also University of Mines and Technology, Vela Partners

CoFEE: Reasoning Control for LLM-Based Feature Discovery

LLMs generate better features when you make them think harder: CoFEE enforces cognitive behaviors like backward chaining and subgoal decomposition, boosting feature quality by 15% while slashing costs.

Maximilian Westermann, Ben Griffin, Aaron Ontoyin Yin +6

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Apr 23, 2026

Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning

Forget memorizing table headers: TaNOS unlocks surprisingly robust numerical reasoning by pre-training on operation sketches and correctness-guaranteed programs.

H. Cho, Gahyun Yoo, H. Kim +1

Data Curation & Synthetic Data Natural Language Processing Reasoning & Chain-of-Thought

Duanyang Yuan +9Apr 23, 2026

Decoupled Travel Planning with Behavior Forest

LLMs can plan complex trips far more effectively when their reasoning is structured as a "forest" of parallel behavior trees, each handling a subtask and coordinated globally.

Duanyang Yuan, Sihang Zhou, Yanning Hou +7

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

Donggyu Lee +6Apr 23, 2026

Ideological Bias in LLMs'Economic Causal Reasoning

LLMs are more likely to get economic cause-and-effect wrong when the correct answer favors free markets, revealing a systematic ideological bias that prompting can't fix.

Donggyu Lee, H. Yun, Jungwon Kim +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Apr 23, 2026·also Meituan

Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

Test-time RL's vulnerability to noisy pseudo-labels is amplified by group-relative advantage estimation, but can be mitigated with a surprisingly simple debiasing and denoising approach.

Yongcan Yu, Lingxiao He, Jian Liang +5

Reasoning & Chain-of-Thought RLHF & Preference Learning

C. Tan +2Apr 23, 2026

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

LLMs can achieve a form of self-programming by integrating crowdsourced learning and human creativity to iteratively refine their own game-playing logic.

C. Tan, Yuchen Wang, Shangxin Guo

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Corresponding authorApr 23, 2026

GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion

Forget flat numerical compression – GS-Quant unlocks better knowledge graph completion by generating discrete codes that mirror the hierarchical nature of human reasoning.

Qizhuo Xie, Yunhui Liu, Yuecheng Xing +4

Inference & Quantization Natural Language Processing Reasoning & Chain-of-Thought

Ye Yu +5Apr 23, 2026

Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

Ditch the fixed interface: DiffMAS unlocks surprisingly large gains in multi-agent reasoning by jointly optimizing latent communication, outperforming text-based and prior latent methods by a wide margin.

Ye Yu, Heming Liu, Haibo Jin +3

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Yvon K. Awuklu +4Apr 23, 2026

Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications

A novel logic-based approach makes inferring complex, temporally-extended events from timestamped data tractable, even in the messy real-world of medical records.

Yvon K. Awuklu, Meghyn Bienvenu, Katsumi Inoue +2

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

Apr 23, 2026

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

LLMs can be both faster and smarter: pre-learned reasoning skills cut down token usage while boosting accuracy on coding and math problems.

Guangxiang Zhao, Qi Shi, Xusen Xiao +3

Inference & Quantization Reasoning & Chain-of-Thought Tool Use & Agents

Hao-Yuan ChenApr 23, 2026

Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

Forget chain-of-thought prompting – iterative refinement guided by structured verbal critique from a stronger LLM can achieve SOTA reasoning performance without any training.

Hao-Yuan Chen

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Nevena Lazi'c +3Apr 23, 2026

To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning

Unseen token generalization in transformers isn't just about copying; it's fundamentally limited by a representational collapse in the unembedding space.

Nevena Lazi'c, Liam H. Fowl, Andr'as Gyorgy +1

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Meghyn Bienvenu +3Apr 23, 2026

Using ASP(Q) to Handle Inconsistent Prioritized Data

Finally, a practical implementation for globally-optimal repair-based semantics allows for querying inconsistent prioritized data with theoretical guarantees.

Meghyn Bienvenu, Camille Bourgaux, Robin Jean +1

Natural Language Processing Reasoning & Chain-of-Thought

Yuehan Zhu +4Apr 23, 2026

HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

Forget rigid workflows: HiCrew's planning layer dynamically orchestrates agents for video understanding, adapting roles and execution paths to the nuances of each question.

Yuehan Zhu, Jingqi Zhao, Jiawen Zhao +2

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Shivam Rawat +3Apr 23, 2026

Reasoning Primitives in Hybrid and Non-Hybrid LLMs

Hybrid architectures that combine attention and recurrence can maintain reasoning performance as task complexity increases, while transformers see a sharp performance drop-off.

Shivam Rawat, Lucie Flek, Florian Mai +1

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

B. Lim +3Apr 23, 2026

VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

Forget hand-annotated visual reasoning datasets: VG-CoT leverages a fully automated pipeline to generate grounded, step-by-step reasoning, enabling scalable and cost-efficient training of more trustworthy LVLMs.

B. Lim, Kyeonghyun Kim, Jung-Shin Yun +1

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Mohit Vaishnav +1Apr 23, 2026

Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning

VLMs' struggles with abstract visual reasoning aren't primarily due to weak reasoning, but rather a representational bottleneck in extracting the right symbolic information from pixels.

Mohit Vaishnav, T. Tammet

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Apr 23, 2026·also Ministry of Education Key Laboratory of Intelligent Networks and Network Security, Shaanxi Province Key Laboratory of Big Data Knowledge Engineering

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

Even the most advanced LLMs like GPT-5.2 and Gemini-3 stumble on complex optimization problems, achieving only 27% accuracy on a new benchmark spanning stochastic, dynamic, and game optimization.

Xinyu Zhang, Boxuan Zhang, Yuchen Wan +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

J. AcuñaApr 23, 2026

EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval

Structured graph memory can outperform full-context prompting for cross-session LLM reasoning, but optimizing for specific reasoning skills can hurt overall performance.

J. Acuña

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Yanjiao Liu +3Apr 23, 2026

Frozen LLMs as Map-Aware Spatio-Temporal Reasoners for Vehicle Trajectory Prediction

Frozen LLMs, when fused with spatial scene encodings, can effectively reason about vehicle trajectories, opening new avenues for integrating language-based reasoning into autonomous driving systems.

Yanjiao Liu, Jiawei Liu, Xun Gong +1

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

Qingxiao Li +6Apr 23, 2026

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

Scientific reasoning gets a visual upgrade: S1-VL lets models "think with images" by writing and executing Python code to manipulate visuals during multi-step problem solving.

Qingxiao Li, Lifeng Xu, Qinglin Wang +4

Code Generation & Program Synthesis Multimodal Models Reasoning & Chain-of-Thought

Chanhong Hwang +4Apr 23, 2026

SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

Spatial reasoning gets a boost: a new framework dynamically orchestrates vision-language agents at test time, outperforming fixed-pipeline approaches by adapting to the reliability of different spatial cues.

Chanhong Hwang, Miso Choi, S. On +2

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Szymon Rusiecki +5Apr 23, 2026

A Bayesian Reasoning Framework for Robotic Systems in Autonomous Casualty Triage

Expert knowledge, encoded in a Bayesian network, can dramatically improve the accuracy of autonomous robotic triage systems operating in chaotic, data-scarce environments.

Szymon Rusiecki, C. Morales, Pia Story +3

Computer Vision Reasoning & Chain-of-Thought Robotics & Embodied AI

Yanjun Zhao +8Apr 23, 2026·also Univeristy of Illinois Urbana Champaign

PAPERMIND: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs

Current multimodal LLMs still struggle to integrate information and reason critically when assessed on real scientific papers, despite progress on isolated tasks.

Yanjun Zhao, Tianxin Wei, Jiaru Zou +6

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

DAMOApr 23, 2026

ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs

LLMs can now directly predict geographic coordinates with high accuracy, even for vague locations and complex regions, bypassing the need for traditional geocoding pipelines.

Gong Wenbin

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Lezhi Ma +5Apr 23, 2026

SpecSyn: LLM-based Synthesis and Refinement of Formal Specifications for Real-world Program Verification

LLMs can now automatically generate formal specifications for real-world programs with high precision and recall, thanks to a novel specification refinement mechanism that leverages program mutations.

Lezhi Ma, Shangqing Liu, Yi Li +3

Code Generation & Program Synthesis Reasoning & Chain-of-Thought

Apr 23, 2026

Planning Beyond Text: Graph-based Reasoning for Complex Narrative Generation

LLMs can write better stories if they plan the plot on a graph first.

Hanwen Gu, Chao Guo, Junle Wang +2

Natural Language Processing Reasoning & Chain-of-Thought World Models & Planning

Shan Dong +5Apr 23, 2026·also Corresponding author

On Reasoning Behind Next Occupation Recommendation

Fine-tuning a single LLM to both reason about and predict future occupations surprisingly beats using two separate fine-tuned LLMs for each task.

Shan Dong, P. Achananuparp, Hieu-Hien Mai +3

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Apr 23, 2026·also Xiaohongshu

Language as a Latent Variable for Reasoning Optimization

LLMs can reason better when they're not forced to answer in English, and a new RL method leverages this quirk to boost performance across reasoning tasks.

Linjuan Wu, Haoran Wei, Jialong Tang +4

Natural Language Processing Reasoning & Chain-of-Thought

Zhiqiu Xu +3Apr 23, 2026

MathDuels: Evaluating LLMs as Problem Posers and Solvers

LLMs that ace math exams can still be stumped by problems crafted by other LLMs, revealing a surprising gap between solving and problem-posing abilities.

Zhiqiu Xu, Shibo Jin, Shreyash Arya +1

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

M. Cramer +1Apr 23, 2026

Satisfying Rationality Postulates of Structured Argumentation Through Deductive Support -- Technical Report

Finally, a structured argumentation framework that doesn't break basic logical rules!

M. Cramer, Tom Friese

Natural Language Processing Reasoning & Chain-of-Thought

Yitong Zhou +4Apr 23, 2026

GeoMind: An Agentic Workflow for Lithology Classification with Reasoned Tool Invocation

Lithology classification gets a reasoning upgrade: GeoMind's agentic workflow beats static methods by grounding decisions in geological evidence and constraints.

Yitong Zhou, Mingyue Cheng, Jiahao Wang +2

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Apr 22, 2026

Apr 22, 2026·also Meituan

V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

Open-source MLLMs can now achieve state-of-the-art accuracy on complex tabular reasoning tasks, even outperforming models 18x their size, by explicitly penalizing visual hallucinations and shortcut guessing through process-supervised RL.

Yubo Jiang, Yitong An, Abudukelimu Wuerkaixi +3

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Apr 22, 2026·also Tsinghua AI, * corresponding author, Huawei, Shenzhen University

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

LLMs can reason more effectively by directly tracking their own belief in the correct answer throughout the reasoning process, enabling more targeted policy updates.

Jingyi Wang, Tengjin Weng, Song-Li Wu +7

Reasoning & Chain-of-Thought RLHF & Preference Learning

Benjamin Hollering +2Apr 22, 2026

Efficient Symbolic Computations for Identifying Causal Effects

Identifying causal effects can now be achieved in quasi-polynomial time, transforming the feasibility of causal inference in complex datasets.

Benjamin Hollering, Pratik Misra, Nils Sturma

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

Apr 22, 2026·also ANU

DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

LLMs can pinpoint mental states but falter at predicting dialogue trajectories, revealing a critical gap in their reasoning capabilities.

Neemesh Yadav, Palakorn Achananuparp, Jing Jiang

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Apr 22, 2026·also HUST, Nankai University

R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling

R2IF achieves up to 34.62% better performance in function calling accuracy, bridging the gap between reasoning and decision-making in LLMs.

A. Cheng, Yongxin Zhao

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Wengyu Zhang +1Apr 22, 2026

Mol-Debate: Multi-Agent Debate Improves Structural Reasoning in Molecular Design

Forget one-shot generation: Mol-Debate's iterative debate loop unlocks state-of-the-art molecular design by dynamically reconciling semantic intent with structural feasibility.

Wengyu Zhang, Xiao-Yong Wei

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Apr 22, 2026·also SJTU, Tencent AI

AROMA: Augmented Reasoning Over a Multimodal Architecture for Virtual Cell Genetic Perturbation Modeling

Achieve more reliable and interpretable virtual cell perturbation predictions by combining knowledge-driven multimodal modeling with evidence retrieval.

Zhenyu Wang, Geyan Ye, Wei Liu +1

Multimodal Models Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

Apr 22, 2026·also Beihang, Fudan, HKU

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

Even the best large vision-language models struggle with multi-image reasoning, scoring only 50% on a new benchmark designed to challenge their capabilities.

Qiguang Chen, Chengyu Luan, Jiajun Wu +5

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Darsh Kachroo +4Apr 22, 2026

HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

LLMs can learn to reason more effectively by breaking down the reasoning process and optimizing each step individually.

Darsh Kachroo, Adriana Caraeni, Arjun Prasaath Anbazhagan +2

Reasoning & Chain-of-Thought RLHF & Preference Learning

Pavel Salovskii +1Apr 22, 2026

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

Ontology augmentation transforms LLMs into robust reasoning agents, significantly boosting performance in complex planning tasks.

Pavel Salovskii, Iuliia Gorshkova

Natural Language Processing Reasoning & Chain-of-Thought World Models & Planning

Thi Ngoc Trang TranApr 22, 2026

Early-Stage Product Line Validation Using LLMs: A Study on Semi-Formal Blueprint Analysis

LLMs can now perform feature model analysis with near-solver accuracy directly from semi-formal blueprints, unlocking early validation in software product line scoping.

Thi Ngoc Trang Tran

Natural Language Processing Reasoning & Chain-of-Thought

Richard B. ArthurApr 22, 2026

A Field Guide to Decision Making

Machine intelligence can transform high-stakes decision-making by enhancing situational awareness and reducing uncertainty, ultimately fostering greater accountability.

Richard B. Arthur

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Jan-Philipp SchmidtApr 22, 2026

ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

Open-source LLMs running on commodity hardware can rival proprietary models on complex actuarial reasoning tasks, but only if you use an LLM judge instead of multiple-choice questions to evaluate them.

Jan-Philipp Schmidt

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Apr 22, 2026

Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness

LLMs can overcome flawed initial hypotheses and achieve state-of-the-art reasoning by proactively identifying and resolving missing information before committing to a solution.

Fulong Fan, Fengzhe Liu, Shuyan Yang +1

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Apr 22, 2026

Bimanual Robot Manipulation via Multi-Agent In-Context Learning

Standard LLMs can now perform complex bimanual robot manipulation tasks with impressive success rates, all without any task-specific training.

Alessio Palma, Indro Spinelli, Vignesh Prasad +4

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

Ryo Tamura +9Apr 22, 2026

LLM-guided phase diagram construction through high-throughput experimentation

LLMs can autonomously navigate the notoriously complex task of alloy phase diagram construction, outperforming traditional ML methods and even exhibiting complementary strengths when combined with domain-specific models.

Ryo Tamura, Haruhiko Morito, Yuna Oikawa +7

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Feng Dong +7Apr 22, 2026·also ZJU

Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data

LLMs can generate better features from tabular data when deployed as a multi-agent system with explicit memory of past procedures, feedback, and concepts.

Feng Dong, Zhi Zheng, Xiao Han +5

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Zibo Xu +5Apr 22, 2026

Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA

Medical VQA models can now reason more reliably thanks to a new framework that disentangles true causal effects from spurious correlations by jointly tackling observable and unobservable confounders.

Zibo Xu, Qiang Li, Ke Lu +3

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Ronghao Ni +2Apr 22, 2026

Taint-Style Vulnerability Detection and Confirmation for Node.js Packages Using LLM Agent Reasoning

Key contribution not extracted.

Ronghao Ni, Mihai Christodorescu, Limin Jia

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Serhii ZabolotniiApr 22, 2026

LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation

Forget prompting LLMs to directly predict hundreds of fields: a two-stage approach with a stable intermediate JSON summary and a deterministic compiler achieves strong performance on CRF filling while being language-agnostic.

Serhii Zabolotnii

Natural Language Processing Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

Apr 22, 2026

Where Reasoning Breaks: Logic-Aware Path Selection by Controlling Logical Connectives in LLMs Reasoning Chains

LLMs' reasoning chains are surprisingly fragile at logical connectives, but targeted interventions at these "forking points" can dramatically improve accuracy more efficiently than brute-force methods.

Seunghyun Park, Yuanyuan Lei

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Yicheng Pan +4Apr 22, 2026

Learning to Solve the Quadratic Assignment Problem with Warm-Started MCMC Finetuning

Solving NP-hard combinatorial optimization problems like QAP just got a whole lot faster, thanks to a novel MCMC finetuning approach that achieves near-zero optimality gaps.

Yicheng Pan, Ruisong Zhou, Haijun Zou +2

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

Apr 22, 2026·also Ministry of Education Key Laboratory of Intelligent Networks and Network Security

Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

Smaller LLMs can achieve superior optimization performance by inheriting structured knowledge distilled from the memories of larger models, without any training.

Zesheng Yang, Bifan Wei

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Apr 22, 2026·also SJTU, UTS

Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking

LLMs can achieve state-of-the-art unsupervised multimodal entity linking by reasoning over diverse evidence types, including graph-based neighborhood information.

Mo Zhou, Jianwei Wang, Kai Wang +2

Multimodal Models Natural Language Processing Reasoning & Chain-of-Thought

Chenyuan Zhang +7Apr 22, 2026·also Tsinghua AI, HIT, SJTU

Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework

Reasoning across languages doesn't have to break the bank: a new framework slashes token costs by over 50% while maintaining accuracy, especially boosting performance in low-resource languages.

Chenyuan Zhang, Qiguang Chen, Xie Chen +5

Inference & Quantization Natural Language Processing Reasoning & Chain-of-Thought

School of Computer Science and Software EngineeringApr 22, 2026·also University of Nottingham

SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

MLLMs still struggle with the spatiotemporal reasoning needed to understand surgical videos, even with chain-of-thought prompting.

Gui Wang, YongSong Zhou, Kaijun Deng +4

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

School of Computer Science and Software EngineeringApr 22, 2026·also Tsinghua AI, University of Nottingham, Wenzhou Medical University

X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis

MLLMs still struggle to integrate diverse data for clinical reasoning, as evidenced by their poor performance on a new ophthalmology benchmark spanning image quality assessment to diagnosis.

Gui Wang, Zehao Zhong, YongSong Zhou +6

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Apr 22, 2026·also Tencent AI

Hybrid Latent Reasoning with Decoupled Policy Optimization

Unleashing the full potential of multimodal LLMs requires reasoning directly in the visual latent space, and this paper shows how to do it with stable policy optimization.

Tao Cheng, Shi-Zhe Chen, Yixin Qin +1

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Apr 22, 2026·also ETH, AI Center Tübingen, ELLIS, Tübingen +1

Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees

Deterministic decoding can outperform stochastic self-consistency in constrained domains by systematically exploring high-probability reasoning traces, leading to better performance with less computation.

Johannes Zenn, Guinan Su, Mrinmaya Sachan +1

Code Generation & Program Synthesis Inference & Quantization Reasoning & Chain-of-Thought

Qizhong Tan +4Apr 22, 2026

Video-ToC: Video Tree-of-Cue Reasoning

Video-ToC drastically improves video understanding by forcing Video LLMs to focus on relevant visual cues, leading to state-of-the-art performance and reduced hallucinations.

Qizhong Tan, Zhuotao Tian, Guangming Lu +2

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Apr 22, 2026·also Google Research, VIA Research Center

R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs

LVLMs can self-detect and correct object hallucinations by focusing on specific image regions, offering a simple, training-free fix.

Jiahao Xie, Nathalie Rauschmayr, Bernt Schiele

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Xiyang Wu +7Apr 22, 2026

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

LLMs can learn to play complex games far more effectively by co-evolving a skill bank with a decision-making agent, enabling consistent long-horizon decision-making.

Xiyang Wu, Zongxia Li, Guangyao Shi +5

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents+1

Apr 22, 2026

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

VLMs are often functionally blind, exploiting language priors instead of truly "seeing" visual data, and this problem paradoxically *worsens* as language models scale.

Karan Goyal, Dikshant Kukreja

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Apr 21, 2026

Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation

Reasoning LLMs can now produce well-calibrated confidence estimates without labels or repeated sampling, unlocking more reliable real-world deployment.

Thomas Zollo, Jimmy Wang, Richard Zemel

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Ying Zeng +13Apr 21, 2026·also vivo BlueImage Lab, vivo Mobile Communication Co.

SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

Forget tedious manual adjustments: SmartPhotoCrafter automatically enhances photos by reasoning about image quality and generating targeted edits.

Ying Zeng, Miaosen Luo, Guangyuan Li +11

Computer Vision Reasoning & Chain-of-Thought Tool Use & Agents

Amirreza Akbari +2Apr 21, 2026

The Logical Expressiveness of Topological Neural Networks

TNNs, a promising alternative to GNNs, can express precisely the binary classifiers definable in topological counting logic, revealing their superior expressive power.

Amirreza Akbari, Amauri H. Souza, Vikas Garg

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought

Ruihong Qiu +1Apr 21, 2026·also UQ

TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

LLMs can learn to reason over complex text-rich networks in a zero-shot manner using reinforcement learning alone, outperforming methods relying on supervised fine-tuning or distillation.

Ruihong Qiu, Zi Huang

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Department of Computer Science and EngineeringApr 21, 2026·also Department of Artificial Intelligence, Department of Information Science and Engineering

Revac: A Social Deduction Reasoning Agent

Winning Mafia against human players requires more than just brute force: Revac-8 shows how combining memory, social network analysis, and adaptive communication can outwit even the most deceptive opponents.

Mihir Shriniwas Arya, Avinash Anish, Aditya Ranjan

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Zineng Dong +5Apr 21, 2026·also SJTU

Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees

Autoformalization gets a major upgrade: DSR's neuro-symbolic approach leverages operator trees to outperform end-to-end LLMs, proving that structured representations are key to bridging human and formal mathematics.

Zineng Dong, Yi Bai, Yifan Bai +3

Code Generation & Program Synthesis Natural Language Processing Reasoning & Chain-of-Thought

Nathaniel Woodward +5Apr 21, 2026

Fine-Tuning Small Reasoning Models for Quantum Field Theory

Small language models can achieve strong performance in specialized scientific domains like quantum field theory with targeted fine-tuning and synthetic data generation.

Nathaniel Woodward, Zhiqi Gao, Y. Kvasiuk +3

Data Curation & Synthetic Data Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

Apr 21, 2026

Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

Teaching LLMs to perform arithmetic on images unlocks a new level of grounded reasoning, paving the way for robots that can understand and manipulate the world more like humans.

Chuou Xu, Liya Ji

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Anton Kolonin +3Apr 21, 2026

Time Series Augmented Generation for Financial Applications

LLMs can achieve near-perfect tool use accuracy and minimal hallucination when reasoning about financial time series, but only if they're allowed to delegate to external tools.

Anton Kolonin, Alexey Glushchenko, Evgeny Bochkov +1

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Jianzhi Yan +4Apr 21, 2026

CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation

LLMs can now reason far better in low-resource domains, thanks to a new method that aligns their thinking with high-resource domains using "reasoning representation alignment."

Jianzhi Yan, Le Liu, Buzhou Tang +2

Data Curation & Synthetic Data Natural Language Processing Reasoning & Chain-of-Thought