Search papers, labs, and topics across Lattice.
The authors introduce LIVE-YT VC, a large-scale subjective video portrait region cropping database containing 1800 videos annotated by 90 human subjects, designed to facilitate research on intelligent video aspect ratio adaptation. To improve annotation consistency, they also create LIVE-YT VC++, a post-processed version of the dataset using a novel intra-frame temporal filter for smoothing. Experiments using SmartVidCrop and fine-tuned video grounding models demonstrate the dataset's utility, establishing it as a benchmark for future research in video cropping and aspect ratio transformation.
A new large-scale dataset of human-annotated video crops enables training models that adapt videos to different aspect ratios while preserving visual quality and meaning.
With the rise of mobile video consumption on diverse handheld display resolutions and orientation modes, altering videos to aspect ratios poses challenges. Static cropping and border padding often compromises visual quality, while warping may distort a video's intended meaning. Here we advocate for a more effective approach: cropping significant regions within video frames in a temporal manner, while minimizing distortion and preserving essential content. One barrier to solving this problem is the lack of sufficiently large-scale database devoted to informing these tasks. Towards filling this gap, we introduce the LIVE-YouTube Video Cropping (LIVE-YT VC) database, featuring 1800 videos, annotated by 90 human subjects. Using videos sourced from the YouTube-UGC and LSVQ Databases, this new resource is the largest publicly-available subjective video portrait region cropping database. We also introduce a post-processed version of the database, called LIVE-YT VC++, whereby a novel intra-frame temporal filter was deployed to smooth subjective annotations within each video. We demonstrate the usefulness of this new data resource using the SmartVidCrop algorithm and state-of-the-art video grounding models, in hopes of establishing our subjective dataset as a benchmark for future research. Our contributions offer a resource for advancing video aspect ratio transformation models towards ensuring that reshaped mobile-friendly video content retains its quality and meaning. Since our labels bear resemblances to video saliency annotations, we also conducted an additional analysis to explore the similarity between our labels and video saliency predictions. Finally, we repurposed state-of-the-art video grounding models for aspect ratio change tasks, and fine-tuned them on our dataset. As a service to the research community, we plan to open source the project.