计算机科学
判决
短语
人工智能
语义学(计算机科学)
自然语言处理
水准点(测量)
情态动词
化学
大地测量学
高分子化学
程序设计语言
地理
作者
Guangli Wu,Zhijun Yang,Jing Zhang
摘要
Temporal sentence grounding in videos (TSGV), which aims to retrieve video segments from an untrimmed videos that semantically match a given query. Most previous methods focused on learning either local or global query features and then performed cross-modal interaction, but ignore the complementarity between local and global features. In this paper, we propose a novel Multi-Level Interaction Network for Temporal Sentence Grounding in Videos. This network explores the semantics of queries at both phrase and sentence levels, interacting phrase-level features with video features to highlight video segments relevant to the query phrase and sentence-level features with video features to learn more about global localization information. A stacked fusion gate module is designed, which effectively captures the temporal relationships and semantic information among video segments. This module also introduces a gating mechanism to enable the model to adaptively regulate the fusion degree of video features and query features, further improving the accuracy of predicting the target segments. Extensive experiments on the ActivityNet Captions and Charades-STA benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI