计算机科学
情态动词
图像融合
人工智能
融合
计算机视觉
图像(数学)
语言学
哲学
化学
高分子化学
作者
Ram Manohar Oruganti,Shagan Sah,Suhas Pillai,Raymond Ptucha
标识
DOI:10.1109/icip.2016.7533033
摘要
Current research in computer vision and machine learning has demonstrated some great abilities at detecting and recognizing objects in natural images. The promising results in these areas have inspired research towards solving more complex multi-modal learning problems in the image/video domains such as automatic annotation, segmentation, labelling, and generic understanding. Although solutions have been provided for one or more of these problems, their approaches have been application specific. This paper introduces an end-to-end trainable Fusion-based Recurrent Multi-Modal (FRMM) model to address multi-modal applications. FRMM allows each input modality to be independent in terms of architecture, parameters and length of input sequences. FRMM image description models seamlessly blend convolutional neural network feature descriptors with sequential language data in a recurrent framework. For training and testing we used the Flickr30K and MSCOCO datasets, demonstrating state-of-the-art description results.
科研通智能强力驱动
Strongly Powered by AbleSci AI