Abstract Data imbalance remains a significant challenge to the practical application of intelligent fault diagnosis in accessory gearboxes. While data augmentation has proven to be an effective solution, deep generative models are difficult to train under limited sample conditions. To address this issue, this paper proposes a hierarchical contextual feature fusion synthetic minority over-sampling technique (HCFF-SMOTE). First, a novel skip-connected encoder-decoder architecture is developed. The skip-connections enhance the model's ability to learn features with limited labeled data. The encoder employs a multi-scale convolutional neural network to hierarchically extract multi-level features. Meanwhile, the decoder integrates an HCFF mechanism, which combines channel attention, spatial attention, and fusion attention to adaptively capture dependencies across these multi-level features, thereby enhancing the fine-grained feature representation. After training, fault samples are mapped into the deep feature space by the encoder. New features are generated using the synthetic minority over-sampling technique (SMOTE) and are then reconstructed by the decoder to synthesize realistic and diverse fault samples. Extensive experiments demonstrate that HCFF-SMOTE outperforms state-of-the-art methods, achieving up to 10.76% higher accuracy compared to the imbalanced dataset with a fault sample proportion of 2.5%, demonstrating its robustness and effectiveness under extreme data imbalance.