计算机科学
乘法(音乐)
单精度浮点格式
浮点型
低延迟(资本市场)
吞吐量
深度学习
并行计算
延迟(音频)
浮点单位
乘法算法
乘数(经济学)
架空(工程)
计算机硬件
计算机工程
算法
算术
人工智能
数学
计算机网络
电信
组合数学
二进制数
经济
无线
宏观经济学
操作系统
作者
Jing Zhang,Libo Huang,Hongbing Tan,Ling Yang,Zhong Zheng,Qianming Yang
标识
DOI:10.1145/3583781.3590269
摘要
Low-precision formats have been proposed and applied to deep learning algorithms to speed up training and inference. This paper proposes a novel multiple-precision multiplication unit(MU) for deep learning. The proposed MU supports four types of precision for floating-point(FP) numbers-FP8-E4M3, FP8-E5M2, FP16, FP32-and 8-bit fixed-point(FIX) numbers. The MU can execute four parallel FP8 and eight parallel FIX8 multiplications simultaneously in one cycle, or four parallel FP16 multiplications fully pipelined with a latency of one, or one FP32 multiplication with a latency of one cycle. The simultaneous execution of FIX8 and FP8 can meet the requirements of the specific deep learning algorithms. Thanks to the low-precision-combination(LPC) and vectorization design method, multiplication in any precision can get 100% utilization of the multiplier resources, and the MU can adopt a lower clock delay to achieve better performance in all data types. Compared with the existing multiple-precision units designed for deep learning, this MU can support more types of low-precision formats by lower area overhead; and exhibits higher throughput at FIX8 with at least 8× improvement.
科研通智能强力驱动
Strongly Powered by AbleSci AI