计算机科学
内存层次结构
并行计算
矩阵乘法
内存带宽
标杆管理
库达
延迟(音频)
隐藏物
量子力学
电信
量子
物理
业务
营销
作者
Matt Martineau,Patrick Atkinson,Simon McIntosh–Smith
标识
DOI:10.1007/978-3-030-10549-5_35
摘要
The V100 GPU is the newest server-grade GPU produced by NVIDIA and introduces a number of new hardware and API features. This paper details the results of benchmarking the V100 GPU and demonstrates that it is a significant generational improvement, increasing memory bandwidth, cache bandwidth, and reducing latency. A major new addition is the Tensor core units, which have been marketed as deep learning acceleration features that enable the computation of a $$4\times 4\times 4$$ half precision matrix-multiply-accumulate operation in a single clock cycle. This paper confirms that the Tensor cores offer considerable performance gains for half precision general matrix multiplication; however, programming them requires fine control of the memory hierarchy that is typically unnecessary for other applications.
科研通智能强力驱动
Strongly Powered by AbleSci AI