As the core carriers of human activities, buildings represent not only the fundamental components of urban spatial structures but also serve critical functions in global resource management, urban planning decisions, disaster risk assessment, and the monitoring of sustainable development. Consequently, they constitute significant features with substantial value in the analysis and application of remote sensing imagery. The issues of mistake and omission extraction and blurred margins that are caused by the UNet’s insufficient utilization of features in different scales prompt the proposal of an improved UNet, i.e., RFTransUNet, which is in support of the feature cross transformer block based on residual network and vision transformer. This net, based on UNet, takes residual blocks as the network backbone, uses the FTrans block in the skip connection part to conduct multiscale feature fusion, and adopts the feature pyramid network as deep supervision in training. Among them, the encoder and decoder based on residual blocks can better retain semantic information when extracting detailed features of images, the FTrans block fuses shallow detailed information and deep semantic information, and the feature pyramid network introduces reference labels to each layer of the network during training. The contrast experiments, aiming at verifying the proposed method, are conducted on two publicly available datasets and a self-built dataset. Versus other comparative methods, the proposed method has clearer and more accurate extracted results with inconspicuous error extraction and better marginal maintenance. The intersection of union of the public satellite and aerial imagery datasets and the self-built unmanned aerial vehicle imagery dataset achieves 71.7862%, 90.6190%, and 84.7210%, respectively.