计算机科学
隐藏字幕
变压器
编码器
发电机(电路理论)
图像(数学)
钥匙(锁)
人工智能
注意力网络
模式识别(心理学)
计算机视觉
功率(物理)
物理
计算机安全
量子力学
电压
操作系统
作者
Hashem Parvin,Ahmad Reza Naghsh-Nilchi,Hossein Mahvash Mohammadi
标识
DOI:10.1016/j.engappai.2023.106545
摘要
Image captioning generates a human-like description for a query image, which has attracted considerable attention recently. The most broadly utilized model for image description is an encoder–decoder structure, where the encoder extracts the visual information of the image, and the decoder generates textual descriptions of the image. Transformers have significantly enhanced the performance of image description models. However, a single attention structure in transformers cannot consider more complex relationships between key and query vectors. Furthermore, attention weights are assigned to entire candidate vectors based on the assumption that entire vectors are related. In this paper, a new double-attention framework is presented, which improves the encoder–decoder structure to consider image captioning problems. Hence, a local generator module and a global generator module are designed to predict textual descriptions collaboratively. The proposed approach improves Self-Attention (SA) from two aspects to enhance the performance of image description. First, a Masked Self-Attention module is presented to attend on the most relevant information. Second, to evade a single shallow attention distribution and make deeper internal relations, a Hybrid Weight Distribution (HWD) module is proposed, that develops SA to use the relations between key and query vectors efficiently. Experiments over the Flickr30k and MS-COCO datasets prove that the proposed approach achieves desirable performance on different evaluation measures compared to the state-of-the-art frameworks.
科研通智能强力驱动
Strongly Powered by AbleSci AI