Surface electromyography (sEMG), which has the advantages of being simple to acquire and quick to respond, is frequently utilized in domains like human–computer interface and prosthetic control as a control source for gesture recognition. Firstly, we propose a method to decompose the sEMG into the time–frequency domain information using the smooth wavelet packet transform (SWPT), which has a faster processing speed compared to previous methods, such as the continuous wavelet transform (CWT) and wavelet packet transform (WPT), requiring only 12% of the time consumption of CWT and 66% of WPT. Secondly, to increase the recognition accuracy of hand gestures, a network model was built using a combination of convolutional neural network (CNN), long short term memory (LSTM), and convolutional block attention module (CBAM) with the accelerometer (ACC) data fusion. With an average accuracy of 92.159%, this approach significantly outperformed other similar research studies when evaluated on the public dataset NapiroDB5.