Abstract This paper introduces a multi‐scale feature fusion deep learning network method for industrial process fault diagnosis based on spatio‐temporal capsules and classifier optimization. In the feature extraction phase, a multi‐scale residual convolution network is initially employed to extract multi‐scale features. Subsequently, the identified fault features are forwarded to the spatio‐temporal capsule network to further extract information related to time and space. After the feature extraction is completed, we replace the traditional softmax classifier with eXtreme Gradient Boosting (XGBoost) to make the final diagnosis more efficient and faster, avoiding the long diagnosis time caused by complex models. The proposed network fully takes into account the nonlinearity, timing, and high‐dimensionality of the original data. The residual network structure can solve the problem of model degradation caused by the deepening of network layers. The LSTM and capsule network structures can minimize the loss of effective feature information for features extraction and the XGBoost algorithm achieves good classification. This ‘offline training, online diagnosis’ method can avoid lengthy training and effectively improve the fault diagnosis efficiency. Our experiments on chemical engineering processes, such as the Tennessee Eastman (TE) process and industrial coking furnace, show that the proposed method significantly improves fault diagnosis accuracy.