计算机科学
分类
模式
边距(机器学习)
人工智能
社会化媒体
过程(计算)
机器学习
样品(材料)
限制
事件(粒子物理)
图形
自然语言处理
万维网
理论计算机科学
化学
工程类
社会学
物理
操作系统
机械工程
量子力学
色谱法
社会科学
作者
Mahdi Abavisani,Liwei Wu,Shengli Hu,Joel Tetreault,Alejandro Jaimes
标识
DOI:10.1109/cvpr42600.2020.01469
摘要
Recent developments in image classification and natural language processing, coupled with the rapid growth in social media usage, have enabled fundamental advances in detecting breaking events around the world in real-time. Emergency response is one such area that stands to gain from these advances. By processing billions of texts and images a minute, events can be automatically detected to enable emergency response workers to better assess rapidly evolving situations and deploy resources accordingly. To date, most event detection techniques in this area have focused on image-only or text-only approaches, limiting detection performance and impacting the quality of information delivered to crisis response teams. In this paper, we present a new multimodal fusion method that leverages both images and texts as input. In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities on a sample by sample basis. In addition, we employ a multimodal graph-based approach to stochastically transition between embeddings of different multimodal pairs during training to better regularize the learning process as well as dealing with limited training data by constructing new matched pairs from different samples. We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
科研通智能强力驱动
Strongly Powered by AbleSci AI