可解释性
计算机科学
特征(语言学)
水准点(测量)
基线(sea)
背景(考古学)
人工智能
视频检索
特征选择
模式识别(心理学)
融合
情报检索
古生物学
哲学
语言学
海洋学
大地测量学
地质学
生物
地理
作者
Fan Hu,Aozhu Chen,Ziyue Wang,Fangming Zhou,Jianfeng Dong,Xirong Li
标识
DOI:10.1007/978-3-031-19781-9_26
摘要
In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016–2020) justify LAFF as a new baseline for text-to-video retrieval.
科研通智能强力驱动
Strongly Powered by AbleSci AI