计算机科学
人工智能
图形
鉴定(生物学)
卷积神经网络
模式识别(心理学)
自然语言处理
情报检索
理论计算机科学
植物
生物
作者
Guang Han,Min Lin,Ziyang Li,Haitao Zhao,Sam Kwong
标识
DOI:10.1109/tmm.2023.3344354
摘要
Text-to-image person re-identification (ReID) is a common subproblem in the field of person re-identification and image-text retrieval. Recent approaches generally follow the structure of a dual-stream network, extracting image and text features. There is no deep interaction between images and text in this approach, making it difficult for the network to learn a highly semantic feature representation. In addition, for both image data and text data, the feature extraction process is modeled in a regular way, such as using Transformer to extract sequence embeddings. However, this type of modeling disregards the inherent relationships among multimodal input embeddings. A more flexible approach to mining multimodal data, which uniformly treats the data as graphs, is proposed. In this way, the extraction and interaction of multimodal information are accomplished by means of messages passing between graph nodes. First, a unified multimodal feature extraction and fusion network is proposed based on the graph convolutional network, which enables the progression of multimodal information from 'local' to 'global'. Second, an asymmetric multilevel alignment module, which focuses on more accurate 'local' information from a 'global' perspective, is proposed to progressively divide the multimodal information at each level. Last, a cross-modal representation matching strategy based on similarity distribution and mutual information is proposed to achieve cross-modal alignment. The proposed algorithm in this paper is simple and efficient, and the testing results on three public datasets (CUHK-PEDES, ICFG-PEDES and RSTPReID) show that it can achieve SOTA-level performance.
科研通智能强力驱动
Strongly Powered by AbleSci AI