Abstract The fault signals of rotating machinery exhibit complex characteristics, including nonlinearity, high noise, and dynamic changes, etc. These issues pose significant challenges to fault diagnosis. Traditional multimodal fusion technologies often fail to fully capture the unique features of different data modalities, resulting in low diagnostic accuracy and robustness under complex working conditions. This paper proposes a FuseCT multimodal feature fusion network, which integrates one-dimensional vibration signals and two-dimensional time-frequency images obtained via Generalized Linear Chirplet Transform (GLCT), and adopts a two-branch structure for processing. The network extracts vibration signal features through the CNN-BiLSTM-1DCBAM module, utilizes the optimized SpectraFocus module to extract GLCT time-frequency features, and finally completes multimodal feature fusion via the self-attention mechanism. This method has been experimentally validated on the gearbox dataset from Southeast University and the extra-large bearing dataset from Nanjing Tech University. The results demonstrate that FuseCT achieves higher accuracy and better generalization ability in rotating machinery fault diagnosis, outperforming traditional single-modal and multi-modal methods.