对抗制
计算机科学
情态动词
频域
领域(数学分析)
人工智能
计算机视觉
数学
数学分析
化学
高分子化学
作者
Yaguan Qian,Qiang Yu,Qiqi Bao,Shouling Ji,Wei Wang,Bin Wang,Zhaoquan Gu,Zhen Lei
标识
DOI:10.1109/tdsc.2025.3601232
摘要
Vision-language pretraining (VLP) models have demonstrated outstanding performance in image-text understanding tasks but remain highly susceptible to transferable adversarial attacks. While ensemble-based guided attacks improve adversarial transferability by increasing the diversity of image-text pairs, they primarily rely on spatial-domain data augmentation, which can lead to model overfitting to image details and limit the generalization capability of attacks. To address this limitation, this study proposes a frequency-domain adjustment-based adversarial attack method that modifies specific frequency components of input images to reduce detail interference and enhance the stability of adversarial examples. Additionally, a fine-grained feature extraction technique is introduced to optimize image-text alignment, further improving the transferability of cross-modal attacks. Experimental results demonstrate that the proposed method achieves superior attack transferability and generalization performance across two major VLP architectures, fusion models and alignment models, as well as multiple tasks on the Flickr30 K and MSCOCO datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI