点积
计算机科学
架空(工程)
并行计算
产品(数学)
建筑
路径(计算)
对偶(语法数字)
计算机硬件
数学
操作系统
几何学
文学类
艺术
视觉艺术
作者
Hongbing Tan,Libo Huang,Zhong Zheng,Hui Guo,Qianmin Yang,Li Shen,Gang Chen,Liquan Xiao,Nong Xiao
标识
DOI:10.1109/tcad.2023.3316994
摘要
The dot-product $\sum _{i=1}^{N} A_{i}\times B_{i}$ is one of the most frequently used operations for a wide variety of high-performance computing (HPC) and artificial intelligence (AI) applications. However, for large-scale algorithms, such as acrshort GEMM and acrshort FFT, independent additions are necessary to accumulate the results of length-limited dot-product in order to form the final result, thus increasing latency and overhead. Hence, we proposed a dot-product-dual-accumulate (DPDAC) architecture capable of performing $\left({\sum _{i=1}^{N=1,2,4} A_{i}\times B_{i} + \sum _{j=1}^{M=1,2} C_{j}}\right)$ on a wide range of formats. The proposed architecture supports both single-path and dual-path execution. The single path is designed for performing acrshort DP acrshort FMA or DPDAC of lower formats, while dual-path supports parallel operations for single-precision (SP) addition and 2-term SP or acrshort TF32 dot-product or 4-term acrshort HP or BF16 dot-product. Moreover, numerical precision conversion is also supported by the proposed architecture, allowing for the conversion of numbers to higher or lower formats. The proposed DPDAC has been demonstrated to significantly reduce the overhead in comparison to discrete designs that utilize multiple single-mode acrshort FP units to achieve the same functionalities. Furthermore, when compared to the state-of-the-art multiple-precision designs, the proposed architecture has been shown to support a wide range of formats and a greater variety of operations with lower costs.
科研通智能强力驱动
Strongly Powered by AbleSci AI