In recent years, multi-view graph-based clustering methods have received great attention due to the ability to integrate complementary features from multiple views to partition samples into the corresponding clusters. However, most existing graph-based approaches belong to shallow models, which cannot extract latent information from complex multi-view data. Inspired by the success of self-attention, this study proposes a Transformer-based multi-view clustering method named MVCformer, which learns a deep non-negative spectral embedding as an indicator matrix for one-stage cluster assignment. In addition, a simple but effective optimization framework, which combines the reconstruction loss from the viewpoint of similarity graph reconstruction and the orthogonal loss to make the learned non-negative embedding column orthogonal, is designed. The proposed method is verified by extensive experiments on nine real-world multi-view datasets. The experimental results demonstrate the superiority of the proposed method compared with the state-of-the-art methods.