计算机科学
亲密度
人工智能
相似性(几何)
视听
任务(项目管理)
帧(网络)
模态(人机交互)
事件(粒子物理)
音频信号处理
比例(比率)
模式识别(心理学)
音频信号
机器学习
计算机视觉
语音识别
图像(数学)
多媒体
语音编码
经济
量子力学
电信
管理
数学分析
数学
物理
作者
Peng Wu,Xiaotao Liu,Jing Liu
标识
DOI:10.1109/tmm.2022.3147369
摘要
Violence detection in videos is very promising in practical applications due to the emergence of massive videos in recent years. Most previous works define violence detection as a simple video classification task and use the single modality of small-scale datasets, e.g., visual signal. However, such solutions are undersupplied. To mitigate this problem, we study weakly supervised violence detection on the large-scale audio-visual violence data, and first introduce two complementary tasks, i.e., coarse-grained violent frame detection and fine-grained violent event detection, to advance the simple violence video classification to frame-level violent event localization, which aims to accurately locate the violent events on untrimmed videos. We then propose a novel network that takes as input audio-visual data and contains three parallel branches to capture different relationships among video snippets and further integrate features, where similarity branch and proximity branch capture long-range dependencies using similarity prior and proximity prior, respectively, and score branch dynamically captures the closeness of predicted score. In both coarse-grained and fine-grained tasks, our approach outperforms other state-of-the-art approaches on two public datasets. Moreover, experiment results also show the positive effect of audio-visual input and relationship modeling.
科研通智能强力驱动
Strongly Powered by AbleSci AI