计算机科学
判别式
模式
语义鸿沟
情态动词
桥接(联网)
变压器
人工智能
力矩(物理)
面子(社会学概念)
任务(项目管理)
接头(建筑物)
模式识别(心理学)
情报检索
图像检索
图像(数学)
物理
经典力学
高分子化学
工程类
建筑工程
计算机网络
社会科学
化学
管理
量子力学
电压
社会学
经济
作者
Mingyao Zhou,Wenjing Chen,Hao Sun,Wei Xie
标识
DOI:10.1109/icassp48485.2024.10445735
摘要
Since the goals of both Moment Retrieval (MR) and Highlight Detection (HD) are to quickly obtain the required content from the video according to user needs, several works have attempted to take advantage of the commonality between both tasks to design transformer-based networks for joint MR and HD. Although these methods achieve impressive performance, they still face some problems: a) Semantic gaps across different modalities. b) Various durations of different query-relevant moments and highlights. c) Smooth transitions among diverse events. To this end, we propose a Cross-modal Multiscale Difference-aware Network, named CMDNet. First, a clip-text alignment module is constructed to narrow semantic gaps between different modalities. Second, a multiscale difference perception module is utilized to mine the differential information between adjacent clips and perform multiscale modeling to obtain discriminative representations. Finally, these representations are fed into the MR and HD task heads to retrieve relevant moments and estimate highlight scores precisely. Extensive experiments on three popular datasets demonstrate that CMDNet achieves state-of-the-art performance.
科研通智能强力驱动
Strongly Powered by AbleSci AI