计算机科学
情态动词
杠杆(统计)
人工智能
编码(社会科学)
特征学习
语音识别
视听
多模式学习
机器学习
模式识别(心理学)
自然语言处理
多媒体
统计
化学
高分子化学
数学
作者
Chao Sun,Min Chen,Jialiang Cheng,Huagen Liang,Chuanbo Zhu,Jincai Chen
标识
DOI:10.1145/3581783.3613805
摘要
Audio and vision are important senses for high-level cognition, and their special strong correlation makes audio-visual coding a crucial factor in many multimodal tasks. However, there are two challenges in audio-visual coding. First, the heterogeneity of multimodal data often leads to misalignment of cross-modal features under the same sample, which reduces their representation quality. Second, most self-supervised learning frameworks are constructed based on instance semantics, and the generated pseudo labels introduce additional classification noise. To address these challenges, we propose a Supervised Cross-modal Contrastive Learning Framework for Audio-Visual Coding (SCLAV). Our framework includes an audio-visual coding network composed of an inter-modal attention interaction module and an intra-modal self-integration module, which leverage multimodal complementary and hidden information for better representation. Additionally, we introduce a supervised cross-modal contrastive loss to minimize the distance between audio and vision features of the same instance, and use weak labels of multimodal data to eliminate the feature-oriented classification noise. Extensive experiments on the AVE and XD-Violence datasets demonstrate that SCLAV outperforms the state-of-the-art results, even with limited computational resources.
科研通智能强力驱动
Strongly Powered by AbleSci AI