Currently, deep learning technology shows significant advantages in improving the efficiency of rolling bearing fault diagnosis. However, the stability and generalization ability of these models are often weakened by complex and variable working conditions and constantly changing data distributions, which lead to poor diagnostic accuracy. To address the above problems, a domain adaptive fault diagnosis method based on multi-layer convolution-guided transformer (MCG-transformer) is proposed in this paper. First, for the inhomogeneity of information distribution in vibration signals, a time-frequency heterogeneous patch division strategy is proposed, while a DSC module is utilized to achieve efficient local time-frequency feature extraction. Second, a multi-layer transformer structure is constructed to enhance the model’s ability to model global dependencies and multi-scale fault features by the convolutional attention mechanism. Third, the classification loss and transfer loss are jointly optimized to achieve end-to-end transfer training on labeled target domains. This approach effectively balances training efficiency and diagnosis performance. Finally, experiments are conducted on Case Western Reserve University (CWRU) and Jiangnan University (JNU) bearing data sets to verify the effectiveness of the proposed method. The experimental results show that the method outperforms the existing mainstream models in fault diagnosis tasks under complex working conditions.