计算机科学
人工智能
事件(粒子物理)
特征学习
特征(语言学)
情态动词
代表(政治)
机器学习
可扩展性
基本事实
模式识别(心理学)
语音识别
哲学
数据库
高分子化学
物理
政治
法学
量子力学
政治学
化学
语言学
作者
Peijun Bao,Wenhan Yang,Boon Poh Ng,M.H. Er,Alex C. Kot
出处
期刊:Proceedings of the ... AAAI Conference on Artificial Intelligence
[Association for the Advancement of Artificial Intelligence (AAAI)]
日期:2023-06-26
卷期号:37 (1): 215-222
被引量:6
标识
DOI:10.1609/aaai.v37i1.25093
摘要
This paper for the first time explores audio-visual event localization in an unsupervised manner. Previous methods tackle this problem in a supervised setting and require segment-level or video-level event category ground-truth to train the model. However, building large-scale multi-modality datasets with category annotations is human-intensive and thus not scalable to real-world applications. To this end, we propose cross-modal label contrastive learning to exploit multi-modal information among unlabeled audio and visual streams as self-supervision signals. At the feature representation level, multi-modal representations are collaboratively learned from audio and visual components by using self-supervised representation learning. At the label level, we propose a novel self-supervised pretext task i.e. label contrasting to self-annotate videos with pseudo-labels for localization model training. Note that irrelevant background would hinder the acquisition of high-quality pseudo-labels and thus lead to an inferior localization model. To address this issue, we then propose an expectation-maximization algorithm that optimizes the pseudo-label acquisition and localization model in a coarse-to-fine manner. Extensive experiments demonstrate that our unsupervised approach performs reasonably well compared to the state-of-the-art supervised methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI