计算机科学
人工智能
推论
图像(数学)
视频处理
计算机视觉
图像融合
模式识别(心理学)
作者
Nuha Aldausari,Arcot Sowmya,Nadine Marcus,Gelareh Mohammadi
标识
DOI:10.1007/978-3-031-22695-3_3
摘要
Generative adversarial networks have attained synthesised results that are not distinguishable from real examples in domains such as image, audio, text and video. While state-of-the-art image models synthesise images with high and diverse quality in many domains, video synthesis is more challenging and suffers from poor generalisation; moreover, the generated videos are not diverse, especially if the network is trained on a limited dataset. In such cases, the model overfits the training examples and performs poorly at inference time. Dataset collection, in general, is a tedious task, and it is even more challenging for video data due to its size and accessibility. Also, creating a video in the first place requires more time and effort. In this paper, we expand a previously collected video dataset with a supporting image dataset. Then, we apply a multiscale fusion method on multiple conditioned images to facilitate diverse video sample generation. We combine the multiscale fusion model with an audio extractor; then, the encoded features are input to a video decoder to generate videos synchronised with the audio signals. We compare our multiscale fusion model with other image fusion models on the Flowers, VGGFace and Animal Faces datasets. We also compare the overall architecture with other audio-to-video models. Both experiments show the effectiveness of our model over others, based on different evaluation metrics such as FID, FVD and LPIPS.
科研通智能强力驱动
Strongly Powered by AbleSci AI