计算机科学
视听
凝视
人工智能
突出
可视化
固定(群体遗传学)
眼动
音频分析器
模态(人机交互)
感觉线索
计算机视觉
集合(抽象数据类型)
视觉注意
人类视觉系统模型
语音识别
感知
音频信号处理
音频信号
语音编码
心理学
图像(数学)
多媒体
人口
人口学
神经科学
社会学
程序设计语言
作者
Xiongkuo Min,Guangtao Zhai,Chunjia Hu,Ke Gu
标识
DOI:10.1109/vcip.2015.7457921
摘要
In this paper, we propose to predict human fixations by incorporating both audio and visual cues. Traditional visual attention models generally make the utmost of stimuli's visual features, while discarding all audio information. But in the real world, we human beings not only direct our gaze according to visual saliency but also may be attracted by some salient audio. Psychological experiments show that audio may have some influence on visual attention, and subjects tend to be attracted the sound sources. Therefore, we propose to fuse both audio and visual information to predict fixations. In our framework, we first localize the moving-sounding objects through multimodal analysis and generate an audio attention map, in which greater value denotes higher possibility of a position being the sound source. Then we calculate the spatial and temporal attention maps using only the visual modality. At last, the audio, spatial and temporal attention maps are fused, generating our final audio-visual saliency map. We gather a set of videos and collect eye-tracking data under audio-visual test conditions. Experiment results show that we can achieve better performance when considering both audio and visual cues.
科研通智能强力驱动
Strongly Powered by AbleSci AI