Jun 4, 2026arXiv:2606.05635

ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions

Dehong Kong, Lina Lei, Lingtao Zheng, Chenyang Wu, Ailing Zhang, Xinran Qin, Teng Ma, Jiaqi Xu, Zhixin Wang, Zhikai Chen, Xuecheng Qi, Renjing Pei, Fan Li

AI Summary

This paper introduces ShotCrop$^3$, a novel approach for generating cinematic triple-shot compositions from a single human-centric image, addressing the gap in existing methods that focus solely on single aesthetic crops. By employing a three-stage training process that includes Chain-of-Thought supervised fine-tuning, semi-supervised learning with high-confidence pseudo labels, and Group Relative Policy Optimization, ShotCrop$^3$ enhances both reasoning and aesthetic capabilities. The method significantly outperforms GPT-5, achieving an average improvement of 2.82 times in shot localization accuracy, demonstrating its effectiveness for creative workflows requiring multi-shot compositions.

Key Contribution

ShotCrop$^3$ transforms a single image into a powerful narrative tool by generating three distinct shots, outperforming existing models in shot localization accuracy.

Abstract

Prior work on aesthetic composition typically produces a single aesthetically pleasing crop, overlooking the narrative value of composing multiple shots from one scene. In practice, multi-shot composition is critical for downstream creative workflows: commercial posters often require multiple crops with different emphases (e.g., context, subject, and emotion/product details) to present key story beats. Therefore, we propose \textbf{Triple-Shot Compositions (TSC)}, a composition task that generates a three-shot set -- establishing, medium, and close-up -- from a single human-centric image, each paired with a brief shot description to support visual narration. To learn TSC with limited expert annotations, we introduce \textbf{ShotCrop} which undergoes a three-stage training process: it first applies Chain-of-Thought supervised fine-tuning to establish basic reasoning and aesthetic shot-cropping skills, then performs semi-supervised fine-tuning with high-confidence pseudo labels to further enhance aesthetic capability, and is finally optimized with Group Relative Policy Optimization for \textbf{ShotCrop} (GRPO-S) using a composite reward tailored for it. Specifically, our pseudo-labeling strategy combines MLLM-based scoring, aesthetic assessment, and CLIP similarity to retain high-confidence training signals. In addition, we present TSC-Bench, a benchmark of 1.2k expert-annotated test cases. Notably, ShotCrop achieves an average improvement of \textbf{2.82} times over GPT-5 in shot localization accuracy.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions

Related Papers