计算机科学
视频压缩图片类型
视频编辑
视频处理
修补
计算机视觉
人工智能
视频制作
视频后处理
帧速率
程式化事实
视频跟踪
帧(网络)
时间分辨率
计算机图形学(图像)
图像(数学)
多媒体
物理
量子力学
电信
经济
宏观经济学
作者
Omer Bar-Tal,Hila Chefer,Omer Tov,Charles Herrmann,Roni Paiss,Shiran Zada,Ariel Ephrat,Junhwa Hur,Yuanzhen Li,Tomer Michaeli,Oliver Wang,Deqing Sun,Tali Dekel,Inbar Mosseri
出处
期刊:Cornell University - arXiv
日期:2024-01-01
被引量:14
标识
DOI:10.48550/arxiv.2401.12945
摘要
We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.
科研通智能强力驱动
Strongly Powered by AbleSci AI