隐藏字幕
计算机科学
变压器
判决
人工智能
计算机视觉
图像(数学)
工程类
电气工程
电压
作者
Zihang Chen,Junjue Wang,Ailong Ma,Yanfei Zhong
标识
DOI:10.1109/lgrs.2022.3192062
摘要
Image captioning in remote sensing can help us understandthe inner attributes of the objects and the outer relations between different objects. However, the existing image captioning algorithms lack the ability of global representation, and cannot obtain object relations over long distances. In addition, these algorithmics generate captions randomly without consideration of the specific demands. To this end, we propose a pure transformer architecture with caption type controller for remote sensing image captioning. Specifically, a multi-scale vision transformer is adopted for the image representation, where the global and detailed content can be captured with multi-head self-attention layers. A transformer decoder is then introduced to successively translate the image features into comprehensive sentences. The optional block called the caption type controller is designed to consider the types of captions through caption type matrix sets according to the demands, embedding the learnable sentence feature with the required type. The comparison and ablation experiments conducted on the Remote Sensing Image Captioning Dataset (RSICD) dataset demonstrate that the proposed framework outperforms the current state-of-the-art image captioning methods. The experiments conducted on the FloodNet caption dataset further illustrate that the proposed methods can effectively generate specific types of captions.
科研通智能强力驱动
Strongly Powered by AbleSci AI