计算机科学
情报检索
粒度
视频检索
水准点(测量)
模式(遗传算法)
帧(网络)
选择(遗传算法)
人工智能
自然语言处理
电信
大地测量学
地理
操作系统
作者
L. Chen,Zhen Deng,Libo Liu,Shibai Yin
标识
DOI:10.1109/tcsvt.2024.3360530
摘要
Video–text cross-modal retrieval (VTR) is more natural and challenging than image–text retrieval, which has attracted increasing interest from researchers in recent years. To align VTR more closely with real-world scenarios, i.e., weak semantic text description as a query, we propose a multilevel semantic interaction alignment (MSIA) model. We develop a two-stream network, which decomposes video and text alignment into multiple dimensions. Specifically, in the video stream, to better align heterogeneity data, redundant video information is suppressed via the designed frame adaptation attention mechanism, and richer semantic interaction is achieved through a text-guided attention mechanism. Then, for text alignment in the video local region, we design a distinctive anchor frame strategy and a word selection method. Finally, a cross-granularity alignment approach is designed to learn more and finer semantic features. With the above schema, the alignment between video and weak semantic text descriptions is reinforced, further alleviating the issues of difficult alignment caused by weak semantic text descriptions. The experimental results on VTR benchmark datasets show the competitive performance of our approach in comparison to that of state-of-the-art methods. The code is available at: https://github.com/jiaranjintianchism/MSIA.
科研通智能强力驱动
Strongly Powered by AbleSci AI