隐藏字幕
计算机科学
频道(广播)
图像(数学)
特征(语言学)
人工智能
推论
网格
钥匙(锁)
模式识别(心理学)
数据挖掘
计算机视觉
机器学习
计算机网络
语言学
哲学
几何学
数学
计算机安全
标识
DOI:10.1109/icassp48485.2024.10446104
摘要
Rich image and text features can largely improve the training of image captioning tasks. However, rich image and text features mean the incorporation of a large amount of unnecessary information. In our work, in order to fully explore and utilize the key information in images and text, we view the combination of image and text features as a data screening problem. The combination of image and text features is dynamically screened through a series of inference strategies with the aim of selecting the optimal image and text features. First, in order to enhance the prior knowledge of the model, three input features, grid image, region image and text, are designed in this paper. Second, the model designs three feature enhancement channels, a global scene enhancement channel, a regional feature enhancement channel and a multimodal semantic enhancement channel, for the multi-source dynamic interaction network. Finally, the model uses a dynamic selection mechanism to choose the most appropriate enhancement features to input to the decoder. We validate the effectiveness of the approach by comparing the baseline model. Moreover, an in-depth analysis of each module demonstrates that the method can more fully utilize the current resources to achieve better results.
科研通智能强力驱动
Strongly Powered by AbleSci AI