Robust and accurate source depth estimation remains a significant challenge in underwater acoustics. A key insight driving this work is the discovery that the interference structures within the vertical line array cross-spectral density matrix, generated by mode interference, exhibit multi-scale local and non-uniform global interference patterns sensitive to source depth. Motivated directly by this physical mechanism analysis, a deep learning-based source depth estimation (DL-SDE) framework is proposed. It integrates a multi-scale convolution module to capture multi-scale local interference patterns via cascaded kernels with expanding receptive field, and a residual multi-head self-attention module to model global non-uniform relationships in the interference field. Numerical simulations demonstrate that DL-SDE shows significantly greater robustness to environmental mismatches compared to matched field processing (MFP), with stable performance at frequencies above 100 Hz and array depths covering at least 50% of the water column. Moreover, saliency visualization validates that these physics-guided components learn representations consistent with the multi-scale interference patterns. Significant performance improvement is validated on the SACLANT 1993 experiment, showing an 11.63 m reduction in mean absolute error and a 71% increase in probability of credible localization over MFP.