Abstract The rapid advancement of spatial multi-omics technologies has opened new avenues for dissecting tissue architecture with unprecedented resolution. However, inherent disparities across omics modalities, such as differences in biological hierarchy and resolution, pose significant challenges for integrative analysis. To address this, we present soFusion, a method for representation learning on spatial multi-omics data that enables automated identification of tissue compartmentalization. soFusion employs a graph convolutional network (GCN) to extract latent embeddings from spatial omics profiles. To simultaneously capture both cross-modality relationships and modality-specific features, we introduce a novel strategy for intra- and inter-omics feature learning. Moreover, modality-specific decoders are designed to preserve the unique information embedded in each omics type. We evaluated soFusion on multiple datasets including gene expression, protein expression, and epigenetic features. Across all benchmarks, soFusion consistently outperformed existing methods in delineating anatomical structures and identifying spatial domains with improved continuity and reduced noise. Collectively, soFusion offers an effective solution for spatial multi-omics integration, substantially enhancing the robustness of spatial domain identification.