舍入
计算机科学
可扩展性
延迟(音频)
深度学习
建筑
高效能源利用
能源消耗
规范化(社会学)
计算机体系结构
人工智能
工程类
电气工程
操作系统
数据库
艺术
社会学
视觉艺术
电信
人类学
作者
Jing Zhang,Libo Huang,Hongbing Tan,Zheng Zhong,Hui Guo
标识
DOI:10.1145/3583781.3590318
摘要
BFloat16(BF16) format has recently driven the development of deep learning due to its higher energy efficiency and less memory consumption than the traditional format. This paper presents a scalable BF16 dot-product(DoP) architecture for high-performance deep-learning computing. A novel 4-term DoP unit is proposed as a fundamental module in the architecture, which performs 4-term DoP operation in three cycles. More-term DoP units are constructed through the extension of the fundamental unit, in which early exponent comparison is performed to hide latency, and intermediate normalization and rounding are omitted to improve accuracy and further reduce latency. Compared with the discrete design, the proposed architecture reduces latency by 22.8% for 4-term DoP, and a larger proportion of latency is reduced as the size of the DoP operation increases. Compared with existing designs for BF16, the proposed architecture at 64-term exhibits better-normalized energy efficiency and higher throughput with at least 1.88× and 20.3× improvement, respectively.
科研通智能强力驱动
Strongly Powered by AbleSci AI