摘要
Neuromorphic accelerators are an emerging class of artificial intelligence (AI) hardware inspired by the brain's architecture and dynamics. Unlike traditional AI accelerators--such as graphics processing units (GPUs), tensor processing units (TPUs), and application-specific integrated circuit (ASICs)--that rely on dense matrix operations and synchronous execution, neuromorphic systems use spiking neural networks (SNNs) to enable ultra-low-power, event-driven computation. These advantages have led to their growing adoption in applications ranging from edge computing and autonomous vehicles to image recognition and time-series data processing. However, as these systems approach real-world deployment, they face considerable reliability and safety challenges. One of the key obstacles is the aggressive scaling of technology, which reduces device sizes and operating voltages, making these systems increasingly susceptible to various types of faults. These include permanent faults--such as manufacturing defects and silicon aging--as well as temporary faults arising from external particle strikes or excessive read operations. For example, emerging memory devices like Phase-Change Memories (PCMs) utilize non-volatile memory (NVM) for in-memory computation. These devices require high-voltage operations--often driven by on-chip charge pumps built with complementary metal-oxide-semiconductor (CMOS) technology--to read and program memory cells. Such operations can accelerate NVM wear-out, thereby increasing the risk of stuck-SET and stuck-RESET faults. Moreover, resistance drift may compromise data integrity, and the aging of peripheral CMOS circuitry can lead to read disturbance errors. Collectively, these issues threaten the accuracy and reliability of neuromorphic systems, particularly when faults occur in synaptic cells, resulting in erroneous values during inference. Failing to detect and mitigate such faults can lead to catastrophic outcomes. For instance, in an autonomous vehicle, a single undetected synaptic fault could cause the system to misinterpret sensor data--potentially mistaking a pedestrian for a shadow--thereby triggering dangerous decisions and significant safety violations. Because neuromorphic computation is event-driven and sparse, these faults may not manifest immediately, silently affecting the system's behavior in critical ways. This makes verification of neuromorphic computing system essential--not only to verify whether neurons spike correctly, but also it's behavior to ensure that spike propagation through the network is accurate and reliable. While neuromorphic hardware holds great promise for next-generation AI, its unique fault modes and operational characteristics demand equally innovative and rigorous approaches to functional testing and fault tolerance. To address these challenges, this dissertation presents two complementary methodologies. The first work proposes a methodology for online built-in self-testing (BIST) of analog in-memory accelerators for deep neural networks (DNNs) to validate correct operation with respect to their functional specifications. The DNN of interest is realized in hardware to perform in-memory computing using NVM cells as computational units. Assuming a functional fault model, we develop methods to generate pseudorandom and structured test patterns to detect hardware faults. We also develop a test-sequencing strategy that combines these different classes of tests to achieve high fault coverage. The testing methodology is applied to a broad class of DNNs trained to classify images from the MNIST, Fashion-MNIST, and CIFAR-10 datasets. The goal is to expose hardware faults which may lead to incorrect classification of images. We achieve an average fault coverage of 94% for these different architectures, some of which are large and complex. The second work focuses on reliable, fault-tolerant, and safe system performance over time. In this work, we develop a fault detection and isolation (FDI) framework for neuromorphic systems to monitor the correctness of a neuromorphic program's execution using model-based redundancy in which a software-based monitor compares discrepancies between the behavior of neurons mapped to hardware and that predicted by a corresponding mathematical model. We identify properties of spike trains generated by neurons that can be used for fault detection and build machine learning (ML) models to forecast these properties. Predictions from these models, which describe the nominal behavior of neurons, when combined with real-time observations, form the basis for FDI. Experiments using CARLSim, a high-fidelity SNN simulator, show that the proposed approach achieves high fault coverage using models that can operate with low computational overhead in real time. This dissertation advances the field of neuromorphic system verification by introducing novel methodologies for both functional self-testing and runtime fault detection. The proposed techniques provide scalable and low-overhead solutions for validating functional correctness, detecting hardware-induced faults, and enhancing the overall reliability and fault resilience of neuromorphic computing platforms. By addressing critical challenges in fault modeling, detection strategies, and online monitoring, this work establishes a strong foundation for the safe, robust, and dependable deployment of neuromorphic systems in real-world, safety-critical environments.