隐藏字幕
计算机科学
变压器
人工智能
计算机视觉
特征(语言学)
特征提取
判决
图像(数学)
语言学
哲学
物理
量子力学
电压
作者
Runyan Du,Wei Cao,Wenkai Zhang,Zhi Guo,Xian Sun,Shuoke Li,Jihao Li
标识
DOI:10.1109/jstars.2023.3305889
摘要
With the growth of remote sensing images, un-derstanding image content automatically has attracted many researchers' interests in deep learning for remote sensing image. Inspired from the natural image captioning, the model with CNN-RNN as the backbone and supplemented by attention has been widely used in remote sensing image captioning. However, it is inefficient for the current attention layer to simultaneously mine hidden foreground from the background of remote sensing image and perform feature interactive learning. Meanwhile, the new mainstream language model has recently surpassed the traditional LSTM in sentence generation. For solving the above problems, in this paper, we proposed a novel thought to make the flat remote sensing images stereoscopic by separating the fore- and background. Based on hierarchical image informa-tion, we designed a novel Deformable Transformer equipped with deformable scaled dot-product attention to learn multi-scale feature from fore- and background through the powerful interactive learning ability. Evaluations are conducted on Four classic remote sensing image captioning datasets. Compared with the state-of-the-art methods, our Transformer variant achieves higher captioning accuracy.
科研通智能强力驱动
Strongly Powered by AbleSci AI