计算机科学
变压器
判决
推论
人工智能
机器翻译
语言理解
自然语言处理
机器学习
工程类
电压
电气工程
作者
Huadong Wang,Xin Shen,Mei Tu,Yimeng Zhuang,Zhiyuan Liu
标识
DOI:10.1109/taslp.2022.3199648
摘要
Recently, the attention mechanism boosts the performance of many neural network models in Natural Language Processing (NLP). Among the various attention mechanisms, Multi-Head Attention (MHA) is a powerful and popular variant. MHA helps the model to attend to different feature subspaces independently which is an essential component of Transformer. Despite its success, we conjecture that the different heads of the existing MHA may not collaborate properly. To validate this assumption and further improve the performance of Transformer, we study the collaboration problem for MHA in this paper. First, we propose the Single-Layer Collaboration (SLC) mechanism to help each attention head improve its attention distribution based on the feedback of other heads. Furthermore, we extend SLC to the cross-layer Multi-Head Dense Collaboration (MHDC) mechanism. MHDC helps each MHA layer learn the attention distributions considering the knowledge from the other MHA layers. Both SLC and MHDC are implemented as lightweight modules with very few additional parameters. When equipped with these modules, our new framework, i.e., Collaborative TransFormer (CollFormer), significantly outperforms the vanilla Transformer on a range of NLP tasks, including machine translation, sentence semantic relatedness, natural language inference, sentence classification, and reading comprehension. Besides, we also carry out extensive quantitative experiments to analyze the properties of the MHDC in different settings. The experimental results validate the effectiveness and universality of MHDC as well as CollFormer.
科研通智能强力驱动
Strongly Powered by AbleSci AI