计算机科学
人工智能
自然语言处理
桥接(联网)
图形
匹配(统计)
模式识别(心理学)
机器翻译
机器学习
情报检索
理论计算机科学
计算机网络
统计
数学
作者
Weixing Mai,Zhengxuan Zhang,Kuntao Li,Yun Xue,Fenghuan Li
标识
DOI:10.1109/tcss.2023.3303027
摘要
Multimodal named entity recognition (MNER) aims to detect named entities and identify the entity types based on texts and attached images, which also generates inputs for other comprehensive tasks, such as multimodal machine translation, visual dialog, and multimodal sentiment analysis. Existing studies have limitations in text-image matching and multimodal semantic disparity reduction. For one thing, current methods fail to resolve both overall and local text-image matching issues in a self-guided way. For another, the static graphs constructed in MNER models are challenging in bridging the semantic gap between different modalities. In this work, a dynamic graph construction framework (DGCF) is proposed to solve the above-mentioned limitations. A similarity vector-based text-image matching inferring strategy is designed to obtain the overall and local matching relation between text and image while the overall matching determines the retained proportion of visual information. Then, a multimodal dynamic graph interaction module is developed. Within each layer of the module, the local matching relations and part of speech (POS)-based multihead attention are integrated to construct a dynamic cross-modal graph and a semantic graph. Lastly, a CRF layer is used to predict entity label. Extensive experiments are performed on two benchmark datasets. The experimental results reveal that our model is a competitive alternative and achieves state-of-the-art performance.
科研通智能强力驱动
Strongly Powered by AbleSci AI