计算机科学
滤波器(信号处理)
情态动词
钥匙(锁)
模态(人机交互)
人工智能
相似性(几何)
模式识别(心理学)
边距(机器学习)
机器学习
计算机视觉
图像(数学)
计算机安全
化学
高分子化学
作者
Yongle Huang,Zedong Liu,Shijie Sun,Ningning Cui,Jianxin Li
标识
DOI:10.1109/tnnls.2025.3577292
摘要
Bridging the gap between visual and textual modalities effectively has consistently been a key challenge in cross-modal retrieval. Fine-grained matching approaches improve performance by precisely aligning salient region features in visual modality with word embeddings in textual modality. However, how to effectively and efficiently filter out irrelevant features (e.g., irrelevant background regions and nonmeaningful prepositions) in multimodality remains a significant challenge. Furthermore, capturing key cross-modal relationships while minimizing misalignment interference is crucial for effective cross-modal retrieval. In this work, we propose a novel approach called the selective filter and alignment network (SFAN) to tackle these challenges. First, we propose modality-specific selective filter modules (SFMs) to selectively and implicitly filter out redundant information within each modality. We then propose the state-space models (SSMs)-based selective alignment module (SAM) to selectively capture key correspondences and reduce the disturbance of irrelevant associations. Finally, we utilize a fusion operation to combine these embeddings from both SFM and SAM to derive the final embeddings for similarity computation. Extensive experiments on the Flickr30k, MS-COCO, and MSR-VTT datasets reveal that our proposed SFAN can effectively learn robust patterns, significantly outperforming the state-of-the-art (SOTA) cross-modal retrieval methods by a wide margin.
科研通智能强力驱动
Strongly Powered by AbleSci AI