计算机科学
融合
人工智能
人机交互
语言学
哲学
作者
Bo Hu,Kai Zhang,Yanghai Zhang,Yuyang Ye
出处
期刊:Proceedings of the ... AAAI Conference on Artificial Intelligence
[Association for the Advancement of Artificial Intelligence (AAAI)]
日期:2025-04-11
卷期号:39 (16): 17267-17275
标识
DOI:10.1609/aaai.v39i16.33898
摘要
In recent years, deep multimodal learning has seen significant advancements. However, there remains a lack of multimodal fusion methods capable of dynamically adjusting the weighting of information both within and across modalities based on input samples. In the domain of multimodal intent recognition, the text modality often contains the most relevant information for intent detection, while the audio and visual modalities provide comparatively less critical information. There is a significant variation in the density of important information across different modalities and samples. To address this challenge, we propose a Dynamic Attention Allocation Fusion (DAF) method with an adaptive network structure that dynamically allocates attention both within individual modalities and across multiple modalities. This approach enables the model to focus more effectively on the most informative modalities and their respective internal features. Furthermore, we introduce a multi-view contrastive learning framework based on DAF (MVCL-DAF). This framework uses distinct and isolated modules to process information from various modalities, taking inspiration from the way the human brain processes multimodal information. Each modality independently infers intent using its respective module, while DAF integrates the multimodal information to produce a comprehensive global intent prediction. The text modality, functioning as the primary modality due to its rich semantic content, guides the other modules in the multi-view contrastive learning process. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI