自动汇总
计算机科学
任务(项目管理)
水准点(测量)
公制(单位)
情报检索
模式
一致性(知识库)
情态动词
滤波器(信号处理)
人工智能
叙述的
自然语言处理
计算机视觉
社会科学
语言学
运营管理
化学
哲学
管理
大地测量学
社会学
高分子化学
经济
地理
作者
Jie Lin,Hao Hua,Ming Chen,Yikang Li,Jen-Hao Hsiao,Chiuman Ho,Jiebo Luo
出处
期刊:Cornell University - arXiv
日期:2023-01-01
被引量:1
标识
DOI:10.48550/arxiv.2303.12060
摘要
Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.
科研通智能强力驱动
Strongly Powered by AbleSci AI