Adobe ResearchMar 10, 2026arXiv:2603.09072

A Text-Native Interface for Generative Video Authoring

Xingyu Bruce Liu, Mira Dontcheva, Dingzeyu Li

AI Summary

This paper introduces Doki, a text-native interface for generative video authoring, aiming to make video creation as intuitive as writing. Doki allows users to define assets, structure scenes, create shots, refine edits, and add audio directly within a text document. A week-long user study with participants of varying expertise demonstrated Doki's capabilities and usability.

Key Contribution

Imagine writing a script that *is* the video editor: Doki lets you do just that, turning text into a powerful interface for generative video authoring.

Abstract

Everyone can write their stories in freeform text format -- it's something we all learn in school. Yet storytelling via video requires one to learn specialized and complicated tools. In this paper, we introduce Doki, a text-native interface for generative video authoring, aligning video creation with the natural process of text writing. In Doki, writing text is the primary interaction: within a single document, users define assets, structure scenes, create shots, refine edits, and add audio. We articulate the design principles of this text-first approach and demonstrate Doki's capabilities through a series of examples. To evaluate its real-world use, we conducted a week-long deployment study with participants of varying expertise in video authoring. This work contributes a fundamental shift in generative video interfaces, demonstrating a powerful and accessible new way to craft visual stories.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Text-Native Interface for Generative Video Authoring

Related Papers