模式
串联(数学)
计算机科学
人工智能
传感器融合
背景(考古学)
融合
机器学习
数学
社会科学
语言学
生物
组合数学
哲学
社会学
古生物学
作者
Gaurav Sahu,Olga Vechtomova
出处
期刊:Cornell University - arXiv
日期:2019-01-01
被引量:6
标识
DOI:10.48550/arxiv.1911.03821
摘要
Effective fusion of data from multiple modalities, such as video, speech, and text, is challenging due to the heterogeneous nature of multimodal data. In this paper, we propose adaptive fusion techniques that aim to model context from different modalities effectively. Instead of defining a deterministic fusion operation, such as concatenation, for the network, we let the network decide "how" to combine a given set of multimodal features more effectively. We propose two networks: 1) Auto-Fusion, which learns to compress information from different modalities while preserving the context, and 2) GAN-Fusion, which regularizes the learned latent space given context from complementing modalities. A quantitative evaluation on the tasks of multimodal machine translation and emotion recognition suggests that our lightweight, adaptive networks can better model context from other modalities than existing methods, many of which employ massive transformer-based networks.
科研通智能强力驱动
Strongly Powered by AbleSci AI