Remote sensing (RS) scene classification aims to classify remote sensing images with similar scene characteristics into one category. Plenty of RS images are complex in background, rich in content, and multi-scale in target, exhibiting the characteristics of both intra-class separation and inter-class convergence. Therefore, discriminative feature representations designed to highlight the differences between classes are the key to RS scene classification. Existing methods represent scene images by extracting either global context or discriminative part features from RS images. However, global-based methods often lack salient details in similar RS scenes, while part-based methods tend to ignore the relationships between local ground objects, thus weakening the discriminative feature representation. In this paper, we propose to combine global context and part-level discriminative features within a unified framework called CGINet for accurate RS scene classification. To be specific, we develop a light context-aware attention block (LCAB) to explicitly model the global context to obtain larger receptive fields and contextual information. A co-enhanced loss module (CELM) is also devised to encourage the model to actively locate discriminative parts for feature enhancement. In particular, CELM is only used during training and not activated during inference, which introduces less computational cost. Benefiting from LCAB and CELM, our proposed CGINet improves the discriminability of features, thereby improving classification performance. Comprehensive experiments over four benchmark datasets show that the proposed method achieves consistent performance gains over state-of-the-art RS scene classification methods.