May 21, 2026arXiv:2605.22126

AesFormer: Transform Everyday Photos into Beautiful Memories

AI Summary

This paper introduces Aesthetic Photo Reconstruction (APR), a novel task focused on improving the aesthetic quality of everyday photos by addressing structural flaws while preserving identity and semantics. To tackle this, the authors propose AesFormer, a two-stage framework that first plans aesthetic actions using an aesthetic action model (AesThinker) trained with GRPO-A for diverse exploration, and then executes these actions via an action-conditioned editor (AesEditor). The work also contributes AesRecon, a new benchmark dataset of 9,071 aligned (poor, good) image pairs, demonstrating AesFormer's superior performance in APR compared to existing methods.

Key Contribution

You can now automatically transform structurally flawed photos into aesthetically pleasing images, thanks to a new framework that plans and executes edits based on photographic principles.

Abstract

In everyday photography, aesthetically appealing moments are often captured with structural flaws (e.g., composition, camera viewpoint, or pose) that existing retouching and portrait enhancement methods cannot fix. We formulate Aesthetic Photo Reconstruction (APR) as improving a photo's aesthetic quality via structural reconstruction while preserving subject identity and scene semantics. Although recent advances in image editing models make APR feasible, they often lack aesthetic understanding, yielding edits that are semantically plausible yet aesthetically weak. To address this, we propose AesFormer, a two-stage framework that decouples aesthetic planning from image editing. In Stage 1, an aesthetic action model (AesThinker) analyzes the input along seven progressive photographic dimensions and outputs executable editing actions; we further apply GRPO-A to encourage broad exploration over diverse action plans beyond SFT. In Stage 2, an action-conditioned editor (AesEditor) performs structural edits guided by these actions. To support APR, we build a video-based corpus-mining pipeline (VCMP) and construct AesRecon, a benchmark of 9,071 strictly aligned (poor, good) image pairs. Experiments show that AesFormer substantially improves APR performance and is competitive with Nano Banana Pro. Code is available at https://github.com/PKU-ICST-MIPL/AesFormer_ICML2026.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AesFormer: Transform Everyday Photos into Beautiful Memories

Related Papers