计算
转置
计算机科学
管道(软件)
变压器
并行计算
计算科学
计算机硬件
算法
程序设计语言
电气工程
电压
物理
工程类
特征向量
量子力学
作者
Fengbin Tu,Zihan Wu,Yiqi Wang,Ling Liang,Liu Liu,Yufei Ding,Leibo Liu,Shaojun Wei,Yuan Xie,Shouyi Yin
标识
DOI:10.1109/isscc42614.2022.9731645
摘要
Transformer models have achieved state-of-the-art results in many fields, like natural language processing and computer vision, but their large number of matrix multiplications (MM) result in substantial data movement and computation, causing high latency and energy. In recent years, computing-in-memory (CIM) has been demonstrated as an efficient MM architecture, but a Transformer's attention mechanism of raises new challenges for CIM in both memory access and computation aspects (Fig. 29.3.1): 1a) Unlike conventional static MM with pre-trained weights, the attention layers introduce dynamic MM (QK T , A'V), whose weights and inputs are both generated at runtime, leading to redundant off-chip memory access for intermediate data. 1b) A CIM pipeline architecture can mitigate the above problem, but produces a new challenge. Since the K generation direction does not match the conventional CIM write direction, the QK T -pipeline needs a large transpose buffer with extra overhead. 2) Compared with fully connected (FC) layers, attention layers dominate a Transformer's computation and require > 8b precision to maintain accuracy, so previous analog CIMs [1]–[2] with $\leq 8\mathsf{b}$ precision support cannot be directly used. Reducing the amount of computation for attention layers is critical for efficiency improvement.
科研通智能强力驱动
Strongly Powered by AbleSci AI