计算机科学
情态动词
人工智能
计算机视觉
化学
高分子化学
作者
Rui Shao,Tianxing Wu,Jianlong Wu,Liqiang Nie,Ziwei Liu
标识
DOI:10.1109/tpami.2024.3367749
摘要
Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. Whilevarious deepfake detection and text fake news detection methods have been proposed, they are only designed for single-modality forgery based on binary classification, let alone analyzing and reasoning subtle forgery traces across different modalities. In this paper, we highlight a new research problem for multi-modal fake media, namely Detecting and G rounding Multi - M odal M edia M anipulation ( DGM4 ). DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content, which requires deeper reasoning of multi-modal media manipulation. To support a large-scale investigation, we construct the first DGM4 dataset. Moreover, we propose a novel Hier Archical M ulti-modal M anipulation r E asoning t R ansformer ( HAMMER ) to fully capture the fine-grained interaction between different modalities. HAMMER performs 1) manipulation-aware contrastive learning between two uni-modal encoders as shallow manipulation reasoning, and 2) modality-aware cross-attention by multi-modal aggregator as deep manipulation reasoning. Dedicated manipulation detection and grounding heads are integrated from shallow to deep levels based on the interacted multi-modal information. To exploit more fine-grained contrastive learning for cross-modal semantic alignment, we further integrate Manipulation-Aware Contrastive Loss with Local View and construct a more advanced model HAMMER++ Finally, we build an extensive benchmark and set up rigorous evaluation metrics for this new research problem. Comprehensive experiments demonstrate the superiority of HAMMER and HAMMER++ ; several valuable observations are also revealed to facilitate future research in multi-modal media manipulation..
科研通智能强力驱动
Strongly Powered by AbleSci AI