Search papers, labs, and topics across Lattice.
This paper introduces a data-driven approach for generating realistic head motion conditioned on gaze, using a conditional Variational Autoencoder (VAE) trained on a large dataset of in-the-wild facial videos. The authors developed an automatic pipeline to extract gaze and head motion data, enabling the VAE to learn the probabilistic correlation and temporal dynamics between the two. Results demonstrate that the generated head motions are perceived as more natural and realistic compared to baselines, significantly improving gaze-controlled facial video generation.
Finally, realistic head movements that naturally complement gaze can be automatically generated from video, thanks to a new data-driven approach.
We present the first data-driven approach to model temporal gaze-head coordination from large-scale in-the-wild facial videos. To obtain training data for generalizable learning, we propose an automatic pipeline that extracts natural yet diverse gaze and head motions with off-the-shelf appearance-based gaze estimators. To capture the probabilistic correlation and temporal dynamics of gaze-head coordination, we build our model on a generative conditional Variational Autoencoder for plausible yet diverse gaze-conditioned head motion generations. We further apply our framework to gaze-controlled facial video generation, where we enable video generation with natural and realistic head motion correlated to the input gaze - an aspect that has not been emphasized before. Human evaluation and quantitative comparisons demonstrate our method's effectiveness and validate our design choices, with evaluators showing statistically significant preference for our approach over baseline methods.