计算机科学
嵌入
概率逻辑
编码器
情报检索
匹配(统计)
语义学(计算机科学)
编码(集合论)
人工智能
情态动词
自然语言处理
程序设计语言
操作系统
统计
集合(抽象数据类型)
化学
高分子化学
数学
作者
Bo Fang,Wenhao Wu,Chang Liu,Yu Zhou,Yuxin Song,Weiping Wang,Xiangbo Shu,Xiangyang Ji,Jingdong Wang
标识
DOI:10.1109/iccv51070.2023.01262
摘要
With the explosive growth of web videos and emerging large-scale vision-language pre-training models, e.g., CLIP, retrieving videos of interest with text instructions has attracted increasing attention. A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities in specific granularities for semantic correspondence. Unfortunately, the intrinsic uncertainties of optimal entity combinations in appropriate granularities for cross-modal queries are understudied, which is especially critical for modalities with hierarchical semantics, e.g., video, text, etc. In this paper, we propose an Uncertainty-Adaptive Text-Video Retrieval approach, termed UATVR, which models each lookup as a distribution matching procedure. Concretely, we add additional learnable tokens in the encoders to adaptively aggregate multi-grained semantics for flexible high-level reasoning. In the refined embedding space, we represent text-video pairs as probabilistic distributions where prototypes are sampled for matching evaluation. Comprehensive experiments on four benchmarks justify the superiority of our UATVR, which achieves new state-of-the-art results on MSR-VTT (50.8%), VATEX (64.5%), MSVD (49.7%), and DiDeMo (45.8%). The code is available at https://github.com/bofang98/UATVR.
科研通智能强力驱动
Strongly Powered by AbleSci AI