ABSTRACT Medical images typically contain complex structures and abundant detail, exhibiting variations in texture, contrast, and noise across different imaging modalities. Different types of images contain both local and global features with varying expressions and importance, making accurate classification highly challenging. Convolutional neural network (CNN)‐based approaches are limited by the size of the convolutional kernel, which restricts their ability to capture global contextual information effectively. In addition, while transformer‐based models can compensate for the limitations of convolutional neural networks by modeling long‐range dependencies, they are difficult to extract fine‐grained local features from images. To address these issues, we propose a novel architecture, the Interactive CNN and Transformer for Cross Attention Fusion Network (IFC‐Net). This model leverages the strengths of CNNs for efficient local feature extraction and transformers for capturing global dependencies, enabling it to preserve local features and global contextual relationships. Additionally, we introduce a cross‐attention fusion module that adaptively adjusts the feature fusion strategy, facilitating efficient integration of local and global features and enabling dynamic information exchange between the CNN and transformer components. Experimental results on four benchmark datasets, ISIC2018, COVID‐19, and liver cirrhosis (line array, convex array), demonstrate that the proposed model achieves superior classification performance, outperforming both CNN and transformer‐only architectures.