对话
情绪识别
心理学
说话人识别
语音识别
计算机科学
语言学
人工智能
认知心理学
自然语言处理
沟通
哲学
作者
Siyuan Shen,Feng Liu,Hanyang Wang,Aimin Zhou
标识
DOI:10.1109/taffc.2025.3558222
摘要
Emotion recognition in conversation has attained increasing attention for perceiving user emotion in practical conversational applications. Conversational utterances spoken alternately by different speakers inspire most studies to leverage speaker information based on golden speaker labels. In this work, we challenge the existing paradigm of utilizing available speaker labels with a more realistic scenario, where the speaker identity of each utterance is unknown during inference. We propose Progressive Contrastive Deep Supervision for multimodal emotion recognition in conversation (PCDS), incorporating speaker diarization and emotion recognition into one unified framework. To facilitate joint task learning, we inject speaker and emotion bias into the network progressively via contrastive deep supervision, with the task-irrelevant contrast being the intermediate transition. To obtain explicit speaker dependency, we propose a speaker contrast and clustering module (SCC) to endow the capability of partitioning speakers into groups even when neither speaker label nor number of speakers is known as a priori. Experiments on two ERC benchmarks, including IEMOCAP and MELD demonstrate the effectiveness of the proposed method. We also show that progressive contrastive deep supervision helps reconcile the underlying tension between speaker diarization and emotion recognition. Source code is available from Github[https://github.com/Cross-Innovation-Lab/PCDS/].
科研通智能强力驱动
Strongly Powered by AbleSci AI