FudanHKUHKUSTSydneyZJUApr 30, 2026arXiv:2604.28078

AesRM: Improving Video Aesthetics with Expert-Level Feedback

Yujin Han, Yujie Wei, Yefei He, Xinyu Liu, Tianle Li, Zichao Yu, Andi Han, Shiwei Zhang, Tingyu Weng, Difan Zou

AI Summary

This paper introduces a hierarchical rubric for evaluating video aesthetics, decomposing it into Visual Aesthetics (VA), Visual Fidelity (VF), and Visual Plausibility (VP) with 15 fine-grained criteria. They create AesVideo-Bench, a large-scale expert-annotated preference dataset, and train Video Aesthetic Reward Models (AesRM), including AesRM-Base and AesRM-CoT, using a three-stage progressive training scheme. Experiments demonstrate that AesRM outperforms baselines on aesthetics benchmarks, exhibits robustness, and improves video aesthetics when used to align Wan2.2.

Key Contribution

Expert-level video aesthetics can be captured and improved by decomposing it into interpretable criteria and training reward models with chain-of-thought reasoning.

Abstract

Despite rapid advances in photorealistic video generation, real-world applications such as filmmaking require video aesthetics, e.g., harmonious colors and cinematic lighting, beyond visual fidelity. Prior work on visual aesthetics largely focuses on images, often reducing aesthetics to coarse definitions, e.g., visual pleasure, without a rigorous and systematic evaluation. To improve video aesthetics, we propose a hierarchical rubric that decomposes video aesthetics into three core dimensions, Visual Aesthetics (VA), Visual Fidelity (VF), and Visual Plausibility (VP), with 15 fine-grained criteria, e.g., shot composition. This framework enables a large-scale expert-annotated preference dataset and an evaluation benchmark, AesVideo-Bench, containing about 2500 video pairs with expert annotations on VA, VF, and VP. We then build a family of Video Aesthetic Reward Models (AesRM): AesRM-Base, which directly predicts pairwise preferences on these dimensions to provide efficient post-training rewards, and AesRM-CoT, which additionally generates CoT aligned with all 15 criteria to improve assessment interpretability. Specifically, we train AesRM with a three-stage progressive scheme: (1) Atomic Aesthetic Capability Learning, which strengthens AesRM's recognition of fundamental aesthetic concepts, e.g., accurately identifying centered composition; (2) Cold-Start, aligning the model with structured reasoning protocols; and (3) GRPO, further improving evaluation accuracy. To enhance AesRM-CoT, we additionally propose self-consistency-based CoT synthesis to improve CoT quality and design CoT-based process rewards during GRPO. Extensive experiments show AesRM outperforms baselines on multiple aesthetics benchmarks and is more robust, with lower position bias. Finally, we align Wan2.2 with AesRM and observe clear aesthetic gains over existing aesthetic reward models.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AesRM: Improving Video Aesthetics with Expert-Level Feedback

Related Papers