Search papers, labs, and topics across Lattice.
This paper introduces Doki, a text-native interface for generative video authoring, aiming to make video creation as intuitive as writing. Doki allows users to define assets, structure scenes, create shots, refine edits, and add audio directly within a text document. A week-long user study with participants of varying expertise demonstrated Doki's capabilities and usability.
Imagine writing a script that *is* the video editor: Doki lets you do just that, turning text into a powerful interface for generative video authoring.
Everyone can write their stories in freeform text format -- it's something we all learn in school. Yet storytelling via video requires one to learn specialized and complicated tools. In this paper, we introduce Doki, a text-native interface for generative video authoring, aligning video creation with the natural process of text writing. In Doki, writing text is the primary interaction: within a single document, users define assets, structure scenes, create shots, refine edits, and add audio. We articulate the design principles of this text-first approach and demonstrate Doki's capabilities through a series of examples. To evaluate its real-world use, we conducted a week-long deployment study with participants of varying expertise in video authoring. This work contributes a fundamental shift in generative video interfaces, demonstrating a powerful and accessible new way to craft visual stories.