片上多核系统
平均故障间隔时间
计算机科学
嵌入式系统
故障率
软件
航空电子设备
多处理
架空(工程)
可靠性
单事件翻转
静态随机存取存储器
可靠性工程
芯片上的系统
工程类
操作系统
计算机硬件
航空航天工程
软件工程
作者
Dimitris Agiakatsikas,Nikos Foutris,Aitzan Sari,Vasileios Vlagkoulis,Ioanna Souvatzoglou,Mihalis Psarakis,Ruiqi Ye,John Goodacre,Mikel Luján,Maria Kastriotou,Carlo Cazzaniga,Chris Frost
标识
DOI:10.1109/tr.2023.3312548
摘要
The AMD UltraScale+ XCZU9EG, a multiprocessor system-on-chip (MPSoC) with integrated programmable logic (PL), is vulnerable to the effects of atmospheric radiation due to its large SRAM count. This article explores the effectiveness of the MPSoC's embedded soft-error mitigation mechanisms through accelerated atmospheric-like neutron radiation testing and dependability analysis. We test the device on a broad range of workloads, such as multithreaded software for pose estimation and weather prediction and a software/hardware codesign image classification application running on the AMD deep-learning processing unit (DPU). We found that for a one-node MPSoC system in New York City at 40 k feet (e.g., avionics), software applications demonstrate a mean time to failure (MTTF) of over 121 months, evidencing effective upset recovery. However, specific workloads, such as the DPU, displayed an MTTF of 4 months, which is attributed to the high failure rate of its PL accelerator. Yet, we show the DPU's MTTF can be extended to 87 months with no extra overhead by ignoring the failure rate of tolerable errors since these do not affect the DPU results.
科研通智能强力驱动
Strongly Powered by AbleSci AI