Change detection (CD) is one of the most important methods for monitoring land surface changes. Recently, transformer-based models have been employed to CD. However, most of them focus on modeling the correlation within each image, while cannot well model the difference between bi-temporal images. In this paper, we propose a difference transformer network (DiFormer) to address this issue. Specifically, we propose a Token Exchange-based Difference Evaluation (TEDE) module to generate the inconsistency between the changed region and the surrounding context to highlight the difference between the bi-temporal images. In addition, to obtain semantically rich exchangeable tokens, we design a Multi-scale Semantic Perception (MSP) module, which provides assistance for difference modeling. In order to test the performance of DiFormer, we conduct qualitative and quantitative experiments on two public datasets, including LEVIR-CD and S2Looking. Experimental results show that our proposed DiFormer is able to achieve better results than several state-of-the-art models, with F1 of 92.15% on the LEVIR-CD dataset and 66.31% on the S2Looking dataset.