计算机科学
深度学习
瓶颈
人工神经网络
无损压缩
深层神经网络
推论
人工智能
体积热力学
数据压缩
并行计算
机器学习
嵌入式系统
量子力学
物理
作者
Sarunya Pumma,Abhinav Vishnu
标识
DOI:10.1109/mlhpc54614.2021.00006
摘要
As the architectures and capabilities of deep neural networks evolve, they become more sophisticated to train and use. Deep Learning Recommendation Model (DLRM), a new neural network for recommendation systems, introduces challenging requirements for deep neural network training and inference. The size of the DLRM model is typically large and not able to fit on a single GPU memory. Unlike other deep neural networks, DLRM requires both model-parallel and data-parallel for the bottom part and top part of the model when running on multiple GPUs. Due to the hybrid-parallel model, the all-to-all communication is used for welding the top and bottom parts together. We have observed that the all-to-all communication is costly and is a bottleneck in the DLRM training/inference. In this paper, we propose a novel approach to reduce the communication volume by using DLRM's properties to compress the transferred data without information loss. We demonstrate benefits of our method by training DLRM MLPerf on eight AMD Instinc$\mathrm{t}^{\mathrm{T}\mathrm{M}}$ MI100 accelerators. The experimental results show 59% and 38% improvement in the time-to-solution of the DLRM MLPerf training for FP32 and mixed-precision, respectively.
科研通智能强力驱动
Strongly Powered by AbleSci AI