模式
计算机科学
人工智能
建筑
变压器
机器学习
多模式学习
特征(语言学)
模态(人机交互)
特征学习
特征提取
深度学习
模式识别(心理学)
工程类
电气工程
哲学
艺术
社会学
视觉艺术
电压
语言学
社会科学
作者
Kenan E. Ak,Gwang-Gook Lee,Xu Yan,Mingwei Shen
标识
DOI:10.1109/icip49359.2023.10223098
摘要
People navigate a world that involves many different modalities and make decision on what they observe. Many of the classification problems that we face in the modern digital world are also multimodal in nature, where textual information on the web rarely occurs alone, and is often accompanied by images, sounds, or videos. The use of transformers in deep learning tasks has proven to be highly effective. However, the relationship between different modalities remains unclear. This paper investigates ways to simultaneously utilize self-attention over both text and vision modalities. We propose a novel architecture that combines the strengths of both modalities. We show that combining a text model with a fixed image model leads to the best classification performance. Additionally, we incorporate a late fusion technique to enhance the architecture's ability to capture multiple modalities. Our experiments demonstrate that our proposed method outperforms state-of-the-art baselines on Food101, MM-IMDB, and FashionGen datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI