计算机科学
数据流
瓶颈
领域(数学分析)
并行计算
计算机硬件
对数
边缘设备
人工神经网络
浮点型
能量(信号处理)
操作数
计算机工程
算法
嵌入式系统
人工智能
数学
操作系统
云计算
数学分析
统计
作者
Yang Wang,Dazheng Deng,Leibo Liu,Shaojun Wei,Shouyi Yin
出处
期刊:IEEE Transactions on Circuits and Systems I-regular Papers
[Institute of Electrical and Electronics Engineers]
日期:2022-06-22
卷期号:69 (10): 4042-4055
被引量:12
标识
DOI:10.1109/tcsi.2022.3184115
摘要
Edge device deep neural network (DNN) training is practical to improve model adaptivity for unfamiliar datasets while avoiding privacy disclosure and huge communication cost. Nevertheless, apart from feed-forward (FF) as inference, DNN training still requires back-propagation (BP) and weight gradient (WG), introducing power-consuming floating-point computing requirements, hardware underutilization, and energy bottleneck from excessive memory access. This paper proposes a DNN training processor named PL-NPU to solve the above challenges with three innovations. First, a posit-based logarithm-domain processing element (PE) adapts to various training data requirements with a low bit-width format and reduces energy by transferring complicated arithmetics into simple logarithm domain operation. Second, a reconfigurable inter-intra-channel-reuse dataflow dynamically adjusts the PE mapping with a regrouping omega network to improve the operands reuse for higher hardware utilization. Third, a pointed-stake-shaped codec unit adaptively compresses small values to variable-length data format while compressing large values to fixed-length 8b posit format, reducing the memory access for breaking the training energy bottleneck. Simulated with 28nm CMOS technology, the proposed PL-NPU achieves a maximum frequency of 1040MHz with 343mW and 5.28mm $\mathbf {^{2}}$ . The peak energy efficiency is 3.87TFLOPS/W for 0.6V at 60MHz. Compared with the state-of-the-art training processor, PL-NPU reaches $3.75\times $ higher energy efficiency and offers $1.68\times $ speedup when training ResNet18.
科研通智能强力驱动
Strongly Powered by AbleSci AI