ABSTRACT The emotional recognition of coal mine workers across dialects faces significant challenges due to the acoustic differences between dialects. Traditional methods fail to distinguish between dialect‐specific and emotional features, leading to poor generalization. To address this, this paper proposes a recognition framework based on speech feature disentanglement, which improves the model's robustness by decoupling shared emotional features from dialect‐specific features. Specifically, the speech signal is transformed into high‐resolution time‐frequency feature maps, and a Siamese Neural Network (SNN) is used for feature disentanglement, separating emotional features into shared public feature maps across dialects and dialect/speaker‐specific private feature maps. The public encoder maximizes mutual information between same‐class samples to learn dialect‐independent common emotional representations, while the private encoder extracts dialect‐related personalized features, reducing the interference of language differences in emotion recognition. Additionally, dialect corpus information is incorporated into the task. Experimental results show that this method significantly improves emotional recognition accuracy in coal mine multilingual environments, while enhancing the system's environmental adaptability.