计算机科学
聚类分析
人工智能
自然语言处理
判决
作者
Kaihui Guo,Wenhua Xu,Tianyang Liu
标识
DOI:10.1109/itaic58329.2023.10409027
摘要
We propose ClusCSE, an unsupervised sentence embedding framework. Contrastive learning has been widely researched for learning universal sentence embeddings in natural language processing. Contrastive methods typically apply well-designed transformations to raw sentences to construct positive pairs and combine different raw sentences to construct negative pairs. Following the usual paradigm of contrastive learning, unsup-SimCSE advanced state-of-the-art unsupervised sentence embeddings by taking dropout as the minimal data augmentation strategy. Considering the training objective, unsup-SimCSE expects to maximize the similarity of positive pairwise instances while minimize the similarity of negative pairwise instances. Indeed, even different raw sentences could be highly semantically similar. Thus, simply reducing the similarity of negative pairwise embeddings is impractical. Sentence embeddings learned by unsup-SimCSE may contain false knowledge of relationships of different sentences. To alleviate it, we introduce online clustering to unsup-SimCSE and thus propose ClusCSE. Instead of just comparing sentences, ClusCSE also enforces consistency between cluster assignments, which makes the embeddings aware of similar sentence groups. Our evaluations on semantic textual similarity tasks demonstrate that our proposed ClusCSE achieves superior performance compared to unsup-SimCSE with higher average Spearman' s correlation of 1.19% on BERT-base.
科研通智能强力驱动
Strongly Powered by AbleSci AI