Mispronunciation Detection and Diagnosis (MDD) is one of the key components of the Computer Assisted Pronunciation Training (CAPT) system. The construction of the current mainstream MDD system is an automatic speech recognition (ASR) system based on DNN-HMM, on which a large amount of labeled data is required for training. In this paper, the self-supervised pre-training model wav2vec2.0 is applied to the MDD task. Self-supervised pre-training uses a large amount of unlabeled data to learn common features, and only a small amount of labeled data is required for training in subsequent applications. In order to utilize the prior text information, the audio features are combined with the text features through the attention mechanism, and the information of both is used in the decoding process. The experiment is conducted on the publicly available L2-Aritic and TIMIT datasets, yielding satisfactory results.