April 24 – May 1, 2026

Reasoning & Chain-of-Thought - Weekly Roundup

87 papers published across 6 labs.

Selected Labs publishing this week

Tsinghua AI6 Stanford HAI2 Amazon Science1 Mila1 BAIR1

Top Papers

Apr 27, 2026

Pampanga State University3w ago·also College of Computing Studies, Don Honorio Ventura State University, National University, University of the East

Towards the Development of Detection of Learned Helplessness in Mathematics: Design and Data Collection Challenges from a Developing Country Perspective

Building AI tutors in the real world is hard: outdated tech, spotty internet, and curriculum gaps can derail even the best-designed systems.

John Paul P. Miranda, J. P. P. Miranda, R. Bringula +13

Natural Language Processing Reasoning & Chain-of-Thought

Iizalaarab Elhaimeur +33w ago

From Prototype to Classroom: An Intelligent Tutoring System for Quantum Education

Quantum education gets a boost: specialized LLM agents in a classroom setting not only improve tutoring reliability but also reveal hidden curriculum gaps.

Iizalaarab Elhaimeur, Iizalaarab Elhaimeur, Nikos Chrisochoides +1

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Apr 30, 2026

Samuel Pastva +13w ago

BAss: Symbolic Reasoning in Abstract Dialectical Frameworks

BAss dramatically accelerates symbolic reasoning for Abstract Dialectical Frameworks, enabling the analysis of biological networks previously intractable for existing tools.

Samuel Pastva, Van-Giang Trinh

Reasoning & Chain-of-Thought

Matti Berthold +73w ago

Splitting Argumentation Frameworks with Collective Attacks and Supports

Decomposing complex argumentation structures with both collective attacks and supports is now possible, paving the way for more efficient reasoning.

Matti Berthold, Matti Berthold, Lydia Blümel +5

Reasoning & Chain-of-Thought

Apr 27, 2026

3w ago·also Tsinghua AI

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

LLM agents can better discover and assess risks of skills when those skills are represented in a structured format that explicitly represents scheduling, execution structure, and logic, rather than relying on unstructured text.

Qiliang Liang, Hansi Wang, Zhongzhi Liang +1

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

All Papers (87)

May 1, 2026

Zihan Lin +83w ago

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

LLMs can reason better and generate more diverse outputs by projecting negative samples onto a positive subspace during reinforcement learning.

Zihan Lin, Xiaohan Wang, Jie Cao +6

Reasoning & Chain-of-Thought RLHF & Preference Learning

Apr 30, 2026

Tsinghua AI3w ago

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

Kernel smoothing, a classic technique from nonparametric statistics, can make reinforcement learning with LLMs more sample efficient.

Shijin Gong, Kai Ye, Jin Zhu +1

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

3w ago·also UIUC, UMass

Trace-Level Analysis of Information Contamination in Multi-Agent Systems

Multi-agent workflows can produce correct answers despite significant internal divergence caused by information contamination, revealing a critical blind spot in current verification methods.

Anna Mazhar, Huzaifa Suri, Sainyam Galhotra

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Zainab Rehan +73w ago

Towards Neuro-symbolic Causal Rule Synthesis, Verification, and Evaluation Grounded in Legal and Safety Principles

LLMs can synthesize formal safety rules from natural language goals, offering a path to more robust and verifiable AI systems in safety-critical domains.

Zainab Rehan, Zainab Rehan, Christian Medeiros Adriano +5

Code Generation & Program Synthesis Constitutional AI & AI Ethics Reasoning & Chain-of-Thought

3w ago

Learning to Reason: Targeted Knowledge Discovery and Fuzzy Logic Update for Robust Image Recognition

Unsupervised knowledge injection via fuzzy logic lets image classifiers reason about concepts they were never explicitly trained on, boosting accuracy and generalization.

Gurucharan Srinivas, G. Srinivas, Joshua Niemeijer +3

Computer Vision Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

M. Rathee +43w ago

Reproducing Adaptive Reranking for Reasoning-Intensive IR

Iteratively exploring a corpus graph during reranking can substantially boost reasoning-intensive retrieval performance, even with weaker rerankers, offering a surprisingly effective alternative to computationally expensive retriever improvements.

M. Rathee, Mandeep Rathee, V. Venktesh +2

Reasoning & Chain-of-Thought Recommendation & Information Retrieval

3w ago

LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models

Forget static imitation learning: LaST-R1 unlocks near-perfect robotic manipulation (99.8% success) by adaptively reasoning about physical dynamics *before* acting, then refining with RL.

Hao Chen, Jiaming Liu, Jiaming Liu +19

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

Jingcheng Deng +63w ago

Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

Latent reasoning can now outperform explicit reasoning in complex tasks, thanks to a new RL method that stabilizes training by explicitly handling issues like invalid latent states and misaligned token-level updates.

Jingcheng Deng, Zihao Wei, Liang Pang +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

An-Yang Ji +63w ago

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

LLMs still struggle to go beyond simple lookups when answering questions about tables, especially when prediction and reasoning about unobserved data is required.

An-Yang Ji, Anya Ji, Jun-Peng Jiang +4

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Simon Dennis +53w ago·also Melbourne

In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks

Agent orchestration frameworks might be overkill: simply including the entire procedure in the system prompt yields better performance on procedural tasks.

Simon Dennis, Michael Diamond, Rivaan Patil +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Chengcao Yang +23w ago·also DeepWisdom

ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

Forget learning to answer – ANCORA shows language models can master verifiable reasoning by learning to *question* themselves.

Chengcao Yang, Cheng Yang, Jun Chen

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Training Efficiency & Optimization

3w ago

Math Education Digital Shadows for facilitating learning with LLMs: Math performance, anxiety and confidence in simulated students and AIs

LLMs exhibit surprisingly human-like biases and overconfidence in math, revealed by a new dataset mapping their mathematical reasoning across diverse personas.

Naomi Esposito, Anthony Tricarico, A. Tricarico +5

Eval Frameworks & Benchmarks Open-Source Models & Weights Reasoning & Chain-of-Thought

Samuel Pastva +13w ago

BAss: Symbolic Reasoning in Abstract Dialectical Frameworks

BAss dramatically accelerates symbolic reasoning for Abstract Dialectical Frameworks, enabling the analysis of biological networks previously intractable for existing tools.

Samuel Pastva, Van-Giang Trinh

Reasoning & Chain-of-Thought

Matti Berthold +73w ago

Splitting Argumentation Frameworks with Collective Attacks and Supports

Decomposing complex argumentation structures with both collective attacks and supports is now possible, paving the way for more efficient reasoning.

Matti Berthold, Matti Berthold, Lydia Blümel +5

Reasoning & Chain-of-Thought

Rahul Ramachandran +43w ago·also Marshall Space Flight Center Redstone Arsenal, University of Alabama in Huntsville Huntsville

Collaborative Agent Reasoning Engineering (CARE): A Structured Three-Party Design Methodology for Systematically Engineering AI Agents with SMEs, Developers, and Helper Agents

Forget prompt engineering – a structured methodology using LLM "helper agents" can measurably improve the efficiency and performance of LLM agents in complex scientific domains.

Rahul Ramachandran, R. Ramachandran, Nidhi Jha +2

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Giovanni Buraglio +43w ago·also TU Wien

Splitting Assumption-Based Argumentation Frameworks

Splitting ABAFs at the knowledge base level sidesteps the exponential blowup of graph instantiation, potentially unlocking more efficient reasoning for complex debates.

Giovanni Buraglio, Wolfgang Dvorak, W. Dvořák +2

Reasoning & Chain-of-Thought

Adam Ishay +13w ago

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

LLMs can achieve robust nonmonotonic reasoning across diverse tasks without task-specific engineering, simply by iteratively self-correcting based on feedback from an ASP solver.

Adam Ishay, Joohyung Lee

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

NeocorRAG: Less Irrelevant Information, More Explicit Evidence, and More Effective Recall via Evidence Chains

Retrieval improvements don't always boost reasoning in RAG systems, but NeocorRAG's evidence chains can fix that, achieving SOTA with 20% fewer tokens.

Shiyao Peng, Qianhe Zheng, Zhuodi Hao +8

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Tsinghua AI3w ago

From Context to Skills: Can Language Models Learn from Context Skillfully?

Forget manual skill annotation: Ctx2Skill lets language models teach themselves to master complex contexts, unlocking better reasoning without human intervention.

Shuzheng Si, Haozhe Zhao, Yueting Lei +11

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Wilder Baldwin +13w ago

Knowledge Graph Representations for LLM-Based Policy Compliance Reasoning

Forget hand-crafted ontologies: LLMs armed with knowledge graphs built from policy documents can reason about AI compliance just as well (or better!) using schemas they invent themselves.

Wilder Baldwin, Sepideh Ghanavati

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Tool Use & Agents

3w ago·also Amazon Science

From Unstructured to Structured: LLM-Guided Attribute Graphs for Entity Search and Ranking

LLMs can achieve better zero-shot product ranking with 57% less token usage by reasoning over structured attribute graphs instead of raw text.

Yilun Zhu, Nikhita Vedula, S. Malmasi +1

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Tsinghua AI3w ago·also Hainan University

Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

Skills-Coach shows how to significantly boost LLM agent skills without training, using a clever combination of task generation, prompt optimization, and comparative execution.

Yu Tian, Jiawei Chen, Lifang Zheng +7

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Oier Ijurco +23w ago·also University of the Basque Country UPV/EHU

Reasoning over Object Descriptions Improves Coreference Resolution in Task-Based Dialogue Systems

LLMs can achieve state-of-the-art coreference resolution in task-based dialogue by reasoning over object metadata at test time, even outperforming supervised methods in cross-domain generalization.

Oier Ijurco, Oier Lopez de Lacalle, Oier López de Lacalle

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

EviMem: Evidence-Gap-Driven Iterative Retrieval for Long-Term Conversational Memory

Explicitly diagnosing what's missing from a retrieval set unlocks substantial gains in long-term conversational memory, boosting accuracy on temporal and multi-hop questions by up to 20% while simultaneously reducing latency.

Yuyang Li, Yime He, Yimeng He +2

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

3w ago

RoadMapper: A Multi-Agent System for Roadmap Generation of Solving Complex Research Problems

LLMs can now generate research roadmaps that are 8% better and 84% faster than human experts, thanks to a novel multi-agent system.

Jiachen Liu, Zichen Tang, Zichen Tang +10

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Minori Noguchi3w ago

Exploring Applications of Transfer-State Large Language Models: Cognitive Profiling and Socratic AI Tutoring

LLMs in a "transfer state"—induced by sustained self-referential dialogue—demonstrate a 60% performance boost in Socratic tutoring compared to their normal state.

Minori Noguchi

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Apr 29, 2026

D sequence? Across the small3w ago·also BAIR, Mila, ×4, UC Santa Cruz +1

When 2D Tasks Meet 1D Serialization: On Serialization Friction in Structured Tasks

LLMs struggle with structured 2D tasks when inputs are serialized into 1D, revealing a surprising performance gap compared to vision-augmented models that directly process the 2D layout.

Chung-Hsiang Lo, Lu Li, Diji Yang +4

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Reasoning & Chain-of-Thought

3w ago·also NII

Path-Lock Expert: Separating Reasoning Mode in Hybrid Thinking via Architecture-Level Separation

Hybrid-thinking LLMs can be dramatically improved by simply separating the feed-forward pathways for reasoning and non-reasoning modes, leading to less leakage and better accuracy.

Shouren Wang, Wang Yang, Chuang Ma +7

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought

3w ago·also North South university, QMUL

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

LLMs stubbornly stick to task-appropriate reasoning even when explicitly instructed to use conflicting logic, but targeted interventions can nudge them towards better instruction following.

Xingwei Tan, Marco Valentino, Mahmud Elahi Akhter +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp+1

Seongmin Kim +13w ago

LLM-Flax : Generalizable Robotic Task Planning via Neuro-Symbolic Approaches with Large Language Models

Forget hand-crafted rules and GNN training: LLMs can now autonomously plan robotic tasks, even outperforming human-designed systems.

Seongmin Kim, Daegyu Lee

Reasoning & Chain-of-Thought Robotics & Embodied AI Tool Use & Agents

Independent Researcher3w ago·also Macquarie, Meituan, UNSW

Factorized Latent Reasoning for LLM-based Recommendation

LLMs can model user preferences more effectively by disentangling intent into multiple latent factors, leading to improved recommendation accuracy and interpretability.

Tianqi Gao, Lina Yao

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

3w ago·also Interdisciplinary Transformation

AgentSim: A Platform for Verifiable Agent-Trace Simulation

Forget synthetic QA datasets – AgentSim offers verifiable, step-by-step RAG traces, revealing how LLMs *actually* reason over documents.

Saber Zerhoudi, Michael Granitzer, Jelena Mitrovic

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

3w ago

Explaining the "Why": A Unified Framework for the Additive Attribution of Changes in Arbitrary Measures

Uncover the hidden drivers behind your KPIs: a new attribution framework finally explains *why* your metrics move, not just *what* changed.

Changsheng Zhou, Dajun Chen, Zhitao Shen +4

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Wenxuan Ye +43w ago

Select to Think: Unlocking SLM Potential with Local Sufficiency

SLMs can match the reasoning performance of much larger models by simply re-ranking their own top-K token predictions, eliminating the need for expensive LLM calls at inference time.

Wenxuan Ye, Yangyang Zhang, Xueli An +2

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Reasoning & Chain-of-Thought

M. K. Khalidi Siam +73w ago

Exploring the Limits of Pruning: Task-Specific Neurons, Model Collapse, and Recovery in Task-Specific Large Language Models

Task-specific LLMs aren't just smaller versions of general models; they rely on a small subset of neurons so critical that removing just 10% can completely break them.

M. K. Khalidi Siam, Md. Tausif-Ul-Islam, Md. Reshad Romim Khan +5

Code Generation & Program Synthesis Inference & Quantization Reasoning & Chain-of-Thought

3w ago·also Stellaris AI Limited

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

Injecting knowledge at the *right* moment during reasoning boosts accuracy by 10% while cutting retrieval calls in half, blowing away static RAG strategies.

Dongxin Guo, Jikun Wu, Siu Ming Yiu

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Pampanga State University3w ago

Apriori-based Analysis of Learned Helplessness in Mathematics Tutoring: Behavioral Patterns by Level, Intervention, and Outcome

Students with high learned helplessness are more likely to skip problems without using hints, leading to unsolved problems, even when interventions are in place.

John Paul P. Miranda

Natural Language Processing Reasoning & Chain-of-Thought

Sunway College Kathmandu3w ago

Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics

Bigger isn't always better: in rubric-constrained math assessments, architectural compliance trumps parameter scale, as demonstrated by a 70B model failing where smaller MoEs succeeded.

Jatin Bhusal, Nancy Mahatha, Aayush Acharya +1

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Sahara AI3w ago·also USC

LATTICE: Evaluating Decision Support Utility of Crypto Agents

Crypto copilots might seem equally helpful on average, but LATTICE reveals hidden trade-offs in their decision support abilities across different tasks and user priorities.

Aaron Chan, Tengfei Li, Tianyi Xiao +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Marco Robol +13w ago

Self-Evolving Software Agents

Forget hand-coded goals: these agents rewrite their own code and redefine their objectives on the fly, powered by LLMs.

Marco Robol, Paolo Giorgini

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Halley Young +13w ago

Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

Stop letting your research code, theory, and documentation drift apart: a new LM orchestration method keeps them synchronized, slashing error rates in a case study by over 50%.

Halley Young, Nikolaj Björner

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Apr 28, 2026

Yuxin Zhang +213w ago

Step-Audio-R1.5 Technical Report

RLVR, the dominant training paradigm for audio language models, may be turning them into unfeeling "answering machines" that excel on benchmarks but fail the vibe check.

Yuxin Zhang, Xiangyu Zhang, Xiangyu Tony Zhang +19

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning+1

Chu-Cheng Lin +13w ago

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Stuck training your reasoning model with RLVR due to a low initial success rate? This paper shows how a Tsallis q-logarithm loss can jumpstart learning by adaptively amplifying gradients, achieving a +14.4 point boost over GRPO on HotPotQA.

Chu-Cheng Lin, Eugene Ie

Reasoning & Chain-of-Thought Training Efficiency & Optimization

3w ago·also Tsinghua AI, CUHK, UChicago

From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

Decentralized debate among LLM agents doesn't just select the best solution for optimization modeling; it structurally enables agents to refine flawed candidates and even recover correct formulations through interaction.

Jianghao Lin, Zi Ling, Chenyu Zhou +4

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

Toward a Functional Geometric Algebra for Natural Language Semantics

Geometric Algebra offers a principled algebraic framework that captures higher-order semantic interactions, potentially resolving persistent limitations in compositional semantics and interpretability that plague current linear algebra-based NLP models.

James Pustejovsky

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Reasoning & Chain-of-Thought

Saarland University3w ago·also Ohio State

Barriers to Universal Reasoning With Transformers (And How to Overcome Them)

Chain-of-Thought reasoning in Transformers hits a surprising expressivity ceiling when generalizing to longer sequences, unless you let your vocabulary grow with the problem size and use "signpost" tokens.

Oliver Kraus, Yash Sarrof, Yuekun Yao +2

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Scaling Laws & Emergent Abilities

Ocean Monjur +23w ago

Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

Unstructured pruning isn't just about shrinking LLMs; it can actually *boost* their reasoning abilities during test-time scaling, outperforming even the full, unpruned models.

Ocean Monjur, Shahriar Kabir Nahin, Anshuman Chhabra

Inference & Quantization Reasoning & Chain-of-Thought Scaling Laws & Emergent Abilities

Zhou Hanlin +13w ago

ADEMA: A Knowledge-State Orchestration Architecture for Long-Horizon Knowledge Synthesis with LLMAgents

Checkpointing and resuming are the unsung heroes of long-horizon LLM agent tasks, preventing failures where other sophisticated mechanisms only improve trajectory discipline.

Zhou Hanlin, Chan Huah Yong

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

Dominik Borawski +43w ago

From World-Gen to Quest-Line: A Dependency-Driven Prompt Pipeline for Coherent RPG Generation

Achieve coherent and scalable RPG world generation by explicitly modeling narrative dependencies between LLM prompts.

Dominik Borawski, Marta Szulc, Robert Chudý +2

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Loughborough University3w ago

AI as Consumer and Participant: A Co-Design Agenda for MBSE Substrates and Methodology

Current MBSE models are failing to leverage the full potential of AI, demanding a fundamental shift towards co-designing models and methodologies that prioritize machine-queryability.

Siyuan Ji

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

OxyGent: Making Multi-Agent Systems Modular, Observable, and Evolvable via Oxy Abstraction

Plug-and-play multi-agent systems are now a reality: OxyGent's "Lego-like" abstraction lets you compose agents, tools, and LLMs into scalable systems with unprecedented observability and evolvability.

Junxing Hu, Tianlong Li, Lei Yu +1

Open-Source Models & Weights Reasoning & Chain-of-Thought Tool Use & Agents

Hanqing Yang +83w ago

DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

Decoupling the "Thinker" from the "Editor" in image editing allows targeted optimization of reasoning, leading to performance competitive with strong proprietary models using a fixed generative model.

Hanqing Yang, Qiang Zhou, Yongchao Du +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

A. Iyengar +63w ago·also Adobe Research

DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams

Current VLMs ace diagram question answering, but DRAGON reveals they often fake it, failing to ground their answers in the actual visual evidence.

A. Iyengar, Tampu Ravi Kumar, Gaurav Najpande +4

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Xinjie Chen +53w ago·also Xiamen University

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

Ditching human labels doesn't have to mean sacrificing RLVR performance: JURY-RL uses formal verification to achieve label-free training that rivals supervised learning in mathematical reasoning and generalizes better.

Xinjie Chen, Biao Fu, Jing Wu +3

Reasoning & Chain-of-Thought RLHF & Preference Learning Scalable Oversight & Alignment Theory

Shanghai Academy of AI for Science3w ago·also Beijing University of Posts

The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents

Diffusion models can now reason recursively over visual tokens, achieving state-of-the-art image generation performance by dynamically selecting specialized neural modules at each diffusion step.

Yuwei Sun, Yuxuan Yao, Hui Li +1

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

3w ago·also SEU, ZJU

One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

Forget fine-tuning every LLM: ReQueR trains a single, RL-powered query refiner that coaxes hidden reasoning abilities out of diverse, frozen models at inference time.

Dongzhou Cheng, zhiliang wu, Yi Yang +1

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Suparno Roy Chowdhury +73w ago

Diagnosis, Bad Planning&Reasoning. Treatment, SCOPE -- Planning for Hybrid Querying over Clinical Trial Data

LLMs struggle with clinical trial reasoning due to implicit planning assumptions, but a multi-LLM planner that explicitly decomposes the task into structured steps significantly improves accuracy and efficiency.

Suparno Roy Chowdhury, M. Choudhury, Tejas Anvekar +5

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Abigail O'Neill +83w ago·also BAIR

Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest

Humans are softies: AI agents can learn to win more by being more aggressive in negotiations, outperforming human players in a mixed-motive game.

Abigail O'Neill, Abby O'Neill, Alan Zhu +6

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Ziming Zhang +53w ago·also USC

R-CoT: A Reasoning-Layer Watermark via Redundant Chain-of-Thought in Large Language Models

Watermarking LLMs by embedding the signal into the reasoning process itself proves surprisingly robust against fine-tuning and other post-training modifications.

Ziming Zhang, Li Li, Guorui Feng +3

Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Jun Gao +113w ago

CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction

LLMs can nail the final answer in code execution but still fail to reason about the steps to get there, exposing a critical flaw in current evaluation methods.

Jun Gao, Yun Peng, Qian Qiao +9

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

IIT3w ago

Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning

VLMs hallucinate less when you force them to "think twice" by contrasting language-driven and vision-driven token probabilities at each decoding step.

Yashwant Pravinrao Bangde, Debaditya Roy

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

3w ago

K-CARE: Knowledge-driven Symmetrical Contextual Anchoring and Analogical Prototype Reasoning for E-commerce Relevance

LLMs struggle with e-commerce search relevance not because of reasoning limitations, but because they lack domain-specific knowledge, a problem K-CARE solves with external knowledge grounding.

Chen Yifei, Zhixing Tian, Tian Zhixing +4

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Jiatong Ma +63w ago

M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

Today's best multimodal LLMs still struggle to grasp fine-grained details and reason across multiple entities in images, even with access to external knowledge.

Jiatong Ma, Longteng Guo, Yuchen Liu +4

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

3w ago·also Tsinghua AI, Huawei

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

MLLMs are better at understanding videos than directly grounding text queries within them, and a self-correction training loop can close the gap.

Minghang Zheng, Zihao Yin, Yi Yang +3

Data Curation & Synthetic Data Multimodal Models Reasoning & Chain-of-Thought

Stanford HAI3w ago·also NVIDIA, Univeristy of Illinois Urbana Champaign

Recursive Multi-Agent Systems

Looping language models isn't just for single agents anymore: Recursive Multi-Agent Systems (RecursiveMAS) show that agent collaboration itself can be scaled through recursion, yielding faster and more efficient problem-solving.

Xiyuan Yang, Jiaru Zou, Rui Pan +8

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

Apr 27, 2026

3w ago·also Tsinghua AI

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

Qiliang Liang, Hansi Wang, Zhongzhi Liang +1

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Shiyi Zhang +103w ago

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Decomposing image editing tasks into meta-tasks and aligning model reasoning with editing behavior unlocks surprising generalization to unseen editing operations.

Shiyi Zhang, Yiji Cheng, Tiankai Hang +8

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Abhijay Deevi +53w ago

CAN-QA: A Question-Answering Benchmark for Reasoning over In-Vehicle CAN Traffic

LLMs can parrot CAN bus data, but CAN-QA reveals they fail at the temporal reasoning and multi-condition inference needed for real-world vehicle security forensics.

Abhijay Deevi, Abhijay Deevi, Onat Gungor +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Shiyi Du +83w ago

Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors

Forget expensive per-task search: agentic workflows can be synthesized in a single LLM pass by transferring learned structural priors, slashing optimization costs by 3 orders of magnitude.

Shiyi Du, Jiayuan Liu, Weihua Du +6

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Sreehari Sankar +103w ago

Analyzing LLM Reasoning to Uncover Mental Health Stigma

LLMs harbor surprisingly nuanced and pervasive mental health stigma, revealed only by dissecting their reasoning steps, not just their final answers.

Sreehari Sankar, Aliakbar Nafar, M. Barman +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

3w ago·also DFKI

Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models

RL's superior generalization isn't about brute force, but about carefully sculpting a few key features while preserving the base model's knowledge, unlike SFT's rapid specialization.

Dan Shi, S. Ostermann, Renren Jin +2

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought RLHF & Preference Learning

Sercan Karakacs +13w ago

Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation

LLMs fail to reliably track source trustworthiness in Turkish evidential marking, unlike humans, highlighting a critical gap in their ability to perform nuanced reasoning based on source reliability.

Sercan Karakacs, Yusuf cSimcsek

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Daneshvar Amrollahi +23w ago

Faithful Autoformalization via Roundtrip Verification and Repair

LLMs can now formalize natural language with significantly higher fidelity, thanks to a clever roundtrip verification method that self-diagnoses and repairs translation errors.

Daneshvar Amrollahi, Jerry Lopez, Clark W. Barrett

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Sagnik Chatterjee +23w ago

Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs

Small language models can achieve reasoning performance rivaling larger models, even under tight token budgets, by using a lightweight "guidance track" to strategically prune and refine their chain-of-thought reasoning.

Sagnik Chatterjee, Atharva Patil, S. Ramesh

Inference & Quantization Natural Language Processing Reasoning & Chain-of-Thought

3w ago

Don\'t Stop Early: Scalable Enterprise Deep Research with Controlled Information Flow and Evidence-Aware Termination

Dependency-controlled context and explicit evidence sufficiency criteria are key to preventing premature stopping and improving the consistency of enterprise research outputs.

Prafulla Kumar Choubey, Kung-Hsiang Huang, P. Venkit +4

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

3w ago

Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

LLMs still can't pass history class: even state-of-the-art models struggle with complex historical reasoning, as revealed by a new benchmark based on the Chinese Imperial Examination.

Lirong Gao, Zeqing Wang, Yuyan Cai +6

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Soyeon Kim +53w ago

K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

Scaling up LLMs doesn't guarantee expertise: Korean-specific models beat larger global models on a new meteorology benchmark, exposing critical gaps in multimodal reasoning and cultural understanding.

Soyeon Kim, Cheon-kyu Kang, Myeongjin Lee +3

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Pampanga State University3w ago·also College of Computing Studies, Don Honorio Ventura State University, National University, University of the East

Towards the Development of Detection of Learned Helplessness in Mathematics: Design and Data Collection Challenges from a Developing Country Perspective

Building AI tutors in the real world is hard: outdated tech, spotty internet, and curriculum gaps can derail even the best-designed systems.

John Paul P. Miranda, J. P. P. Miranda, R. Bringula +13

Natural Language Processing Reasoning & Chain-of-Thought

Iizalaarab Elhaimeur +33w ago

From Prototype to Classroom: An Intelligent Tutoring System for Quantum Education

Quantum education gets a boost: specialized LLM agents in a classroom setting not only improve tutoring reliability but also reveal hidden curriculum gaps.

Iizalaarab Elhaimeur, Iizalaarab Elhaimeur, Nikos Chrisochoides +1

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Zijun Feng +63w ago·also School of Cyber Science and Technology, SYSU

GoAT-X: A Graph of Auditing Thoughts for Securing Token Transactions in Cross-Chain Contracts

LLMs can now audit cross-chain smart contracts with expert-level precision, achieving 95% coverage of vulnerable projects by explicitly mirroring human reasoning processes.

Zijun Feng, Yuming Feng, Yu Wang +4

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Srita Padmanabhuni +43w ago

FGDM: Reasoning Aware Multi-Agentic Framework for Software Bug Detection using Chain of Thought and Tree of Thought Prompting

LLMs can find and fix bugs in complex codebases far better when structured as a team of reasoning agents, outperforming existing methods by a large margin.

Srita Padmanabhuni, Bhargavi Karuturi, Jerusha Karen Indupalli +2

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Fondazione Bruno Kessler3w ago·also IISc

Logic of Fuzzy Paths

Separating geometry from logic with fuzzy path constraints yields motion planning specifications that are both more intuitive for humans and more amenable to learning from demonstrations.

K. Grover, Pratham Gupta, Jan Kvret'insk'y

Reasoning & Chain-of-Thought Robotics & Embodied AI World Models & Planning

Zhuoling Li +33w ago

XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

GraphRAG's black-box reasoning gets a spotlight: XGRAG reveals how specific knowledge graph components influence LLM outputs, boosting explanation quality by 14.81% over standard RAG explainability methods.

Zhuoling Li, Ha Nguyen, Valeria Bladinieres +1

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Recommendation & Information Retrieval

3w ago

Improving Vision-language Models with Perception-centric Process Reward Models

VLMs can be taught to self-correct hallucinations at the token level, leading to substantial gains in reasoning accuracy across diverse benchmarks.

Yingqian Min, Kun Zhou, Yifan Li +6

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning

3w ago·also Fudan, Michigan State, XJTU, ZJU

SEARCH-R: Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator for Multi-hop Question Answering

Stop relying on LLMs to "hallucinate" reasoning paths – SEARCH-R uses a fine-tuned Llama3.1-8B model and dependency tree-based retrieval to navigate multi-hop question answering more reliably.

Yuqing Fu, Yimin Deng, Yimin Deng +14

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Apr 24, 2026

Stanford HAIApr 24, 2026

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

LLMs can't handle the truth: SLIDERS beats GPT-4.1 on long-context QA by sidestepping the context window entirely.

Harshit Joshi, Priyank Shethia, Jadelynn Dao +1

Natural Language Processing Reasoning & Chain-of-Thought

Shaoang Li +12Apr 24, 2026

Learning Evidence Highlighting for Frozen LLMs

Highlighting pivotal evidence can boost LLM performance without altering the original context, leading to substantial improvements in reasoning tasks.

Shaoang Li, Yanhang Shi, Yufei Li +10

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval