超级计算机
泰坦(火箭家族)
计算机科学
并行计算
可靠性(半导体)
可靠性工程
天体生物学
物理
工程类
量子力学
功率(物理)
作者
George Ostrouchov,Don Maxwell,Rizwan A. Ashraf,Christian Engelmann,Mallikarjun Shankar,James H. Rogers
标识
DOI:10.1109/sc41405.2020.00045
摘要
The Cray XK7 Titan was the top supercomputer system in the world for a long time and remained critically important throughout its nearly seven year life. It was an interesting machine from a reliability viewpoint as most of its power came from 18,688 GPUs whose operation was forced to execute three rework cycles, two on the GPU mechanical assembly and one on the GPU circuitboards. We write about the last rework cycle and a reliability analysis of over 100,000 years of GPU lifetimes during Titan’s 6-year-long productive period. Using time between failures analysis and statistical survival analysis techniques, we find that GPU reliability is dependent on heat dissipation to an extent that strongly correlates with detailed nuances of the cooling architecture and job scheduling. We describe the history, data collection, cleaning, and analysis and give recommendations for future supercomputing systems. We make the data and our analysis codes publicly available.
科研通智能强力驱动
Strongly Powered by AbleSci AI