计算机科学
安全性令牌
变压器
机器翻译
人工智能
机制(生物学)
算法
失真(音乐)
电压
认识论
带宽(计算)
放大器
哲学
物理
量子力学
计算机安全
计算机网络
作者
Linqing Liu,Xiaolong Xu
标识
DOI:10.1016/j.knosys.2023.110784
摘要
The self-attention mechanism is a feature processing mechanism for structured data in deep learning models. It has been widely used in transformer-based deep learning models and has demonstrated superior performance in various fields, such as machine translation, speech recognition, text-to-text conversion, and computer vision. The self-attention mechanism mainly focuses on the surface structure of structured data, but it also involves attention between basic data units and self-attention of basic data units in the deeper structure of the data. In this paper, we investigate the forward attention flow and backward gradient flow in the self-attention module of the transformer model based on the sequence-to-sequence data structure used in machine translation tasks. We found that this combination produces a "gradient distortion" phenomenon at the token level of basic data units. We consider this phenomenon a defect and propose a series of solutions to address it theoretically. Furthermore, we conduct experiments and select the most robust solution as the Unevenness-Reduced Self-Attention (URSA) module, which replaces the original self-attention module. The experimental results demonstrate that the "gradient distortion" phenomenon exists both theoretically and numerically, and the URSA module enables the self-attention mechanism to achieve consistent, stable, and effective optimization across different models, tasks, corpora, and evaluation metrics. The URSA module is both simple and highly portable.
科研通智能强力驱动
Strongly Powered by AbleSci AI