计算机科学
变压器
编码器
人工智能
杠杆(统计)
相关性
模式
建筑
模式识别(心理学)
自然语言处理
数学
工程类
几何学
操作系统
社会科学
视觉艺术
电气工程
电压
社会学
艺术
作者
Pengfei Wei,H. F. Ouyang,Qintai Hu,Bi Zeng,Guang Feng,Qingpeng Wen
标识
DOI:10.1145/3652583.3658097
摘要
Multimodal Named Entity Recognition (MNER) aims to leverage visual information to identify entity boundaries and categories in social media posts. Existing methods mainly adopt heterogeneous architecture, with ResNet (CNN-based) and BERT (Transformer-based) dedicated to modeling visual and textual features, respectively. However, current approaches still face the following issues: (1) Weak cross-modal correlations and poor semantic consistency. (2) Suboptimal fusion results when visual objects and textual entities are inconsistent. To this end, we propose a Hybrid Transformer with Visual-Enhanced Cross-Modal Multi-level Interaction (VEC-MNER) model for MNER. Specifically, compared to heterogeneous architectures, we propose a new homogeneous Hybrid Transformer Architecture, which naturally reduces the heterogeneity. Moreover, we design the Correlation-Aware Alignment (CAA-Encoder) layer and the Correlation-Aware Deep Fusion (CADF-Encoder) layer, combined with contrastive learning, to achieve more effective implicit alignment and deep semantic fusion between modalities, respectively. We also construct a Correlation-Aware (CA) module that can effectively reduce heterogeneity between modalities and alleviate visual deviation. Experimental results demonstrate that our approach achieves SOTA performance, achieving 74.89% and 87.51% F1-score on Twitter-2015 and Twitter-2017, respectively.
科研通智能强力驱动
Strongly Powered by AbleSci AI