Microsoft ResearchAdobe ResearchMar 7, 2026arXiv:2603.07148

Agentic Planning with Reasoning for Image Styling via Offline RL

Subhojyoti Mukherjee, Stefano Petrangeli, B. Kveton, Trung Bui, Franck Dernoncourt, Arko Provo Mukherjee

AI Summary

The paper introduces a tool-based agentic planning framework for image styling that decomposes complex transformations into interpretable tool sequences using a compositional library of primitive transformations, structured context representation, and explicit per-step reasoning. To train this framework, the authors created three large-scale synthetic datasets (~10K trajectories each) with reasoning chains, plans, and quality scores, which are then used for offline RL post-training. Experiments using 4B and 8B parameter Qwen3-VL models demonstrate that the proposed method outperforms baselines in visual quality and instruction following, as validated by human evaluations.

Key Contribution

Forget direct prompt editing: this agentic planning framework, powered by offline RL and synthetic data, masters complex image styling by breaking it down into interpretable tool sequences.

Abstract

Direct prompt-based editing often fails on complex transformations because vague and subjective prompts often require nuanced understanding of what should be changed in the image. Our core intuition is that leveraging compositional image editing tools rather than direct prompting profits from structured agent-level planning with explicit reasoning, leading to better results. This structured planning framework enables efficient offline RL post-training on quality-scored trajectories to improve performance. We present a tool-based agentic RL post-training framework that addresses this through structured planning with chain-of-thought reasoning. Our key contributions include: (1) A tool-based agentic planning methodology that combines a compositional library of orthogonal primitive transformations, structured context representation, and explicit per-step reasoning to decompose complex styling into interpretable tool sequences. (2) A synthetic data generation pipeline producing three large-scale datasets (each $\sim$10K trajectories) with reasoning chains, plans, and quality scores, as no existing datasets provide such supervision. Our datasets and code are publicly available at the HuggingFace repository. (3) Offline RL training methods for learning planners with reasoning as our core algorithmic contributions, which consistently improve over the Edit-Only baseline in visual quality and instruction following. (4) Comprehensive evaluation across 4B and 8B parameter Qwen3-VL models showing that our methods outperform other baselines in the majority of compositional tasks, validated by human evaluations.

Computer Vision Tool Use & Agents World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Agentic Planning with Reasoning for Image Styling via Offline RL

Related Papers