计算机科学
自然语言处理
人工智能
词汇
生成语法
判别式
杠杆(统计)
语言模型
语法
语言学
哲学
作者
Yile Wang,Yue Zhang,Peng Li,Yang Liu
标识
DOI:10.1109/taslp.2023.3331096
摘要
Pre-training serves as a foundation of recent NLP models, where language modeling tasks are performed over large texts. Typical models like BERT and GPT take the corpus as a whole and treat each word equally for language modeling. However, recent works show that the naturally existing frequency bias in the raw corpus may limit the power of the language model. In this paper, we propose a multi-stage training strategy that gradually increases the training vocabulary by modifying the training data. Specifically, we leverage the syntactic structure as a bridge for infrequent words and replace them with the corresponding syntactic labels, then we recover their original lexical surface for further training. Such strategy results in an easy-to-hard curriculum learning process, where the model learns the most common words and some basic syntax concepts, before recognizing a large number of uncommon words via their specific usages and the previously learned category knowledge. Experimental results show that such a method can improve the performance of both discriminative and generative pre-trained language models on benchmarks and various downstream tasks.
科研通智能强力驱动
Strongly Powered by AbleSci AI