计算机科学
隐藏字幕
变压器
人工智能
编码器
判决
自然语言处理
图像(数学)
量子力学
操作系统
物理
电压
作者
Dizhan Xue,Shengsheng Qian,Quan Fang,Changsheng Xu
标识
DOI:10.1145/3503161.3548022
摘要
As a specific form of story generation, Image-guided Story Ending Generation (IgSEG) is a recently proposed task of generating a story ending for a given multi-sentence story plot and an ending-related image. Unlike existing image captioning tasks or story ending generation tasks, IgSEG aims to generate a factual description that conforms to both the contextual logic and the relevant visual concepts. To date, existing methods for IgSEG ignore the relationships between the multimodal information and do not integrate multimodal features appropriately. Therefore, in this work, we propose Multimodal Memory Transformer (MMT), an end-to-end framework that models and fuses both contextual and visual information to effectively capture the multimodal dependency for IgSEG. Firstly, we extract textual and visual features separately by employing modality-specific large-scale pretrained encoders. Secondly, we utilize the memory-augmented cross-modal attention network to learn cross-modal relationships and conduct the fine-grained feature fusion effectively. Finally, a multimodal transformer decoder constructs attention among multimodal features to learn the story dependency and generates informative, reasonable, and coherent story endings. In experiments, extensive automatic evaluation results and human evaluation results indicate the significant performance boost of our proposed MMT over state-of-the-art methods on two benchmark datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI