In the process of mining and de novo designing of new enzymes, the solubility of proteins is one of the key factors determining the efficiency of their functional expression. The development of solubility prediction algorithms is important for reducing experimental costs and enhancing the success of protein engineering. However, only a small number of studies have involved the input of protein structural information, which extremely limited the models' accuracy and generalization. Here, we developed a protein solubility prediction architecture named MTPSol by utilizing pretrained models to extract protein features and process the multimodal input of proteins. To further improve the performance of the architecture, cross-modal twin attention and multiscale feature networks were developed to integrate the multimodal features. Evaluating MTPSol with public benchmark data sets, MTPSol demonstrates that the architecture achieves competitive predictive performance. In the assessment conducted on our constructed and validated transaminase data set, MTPSol outperformed existing state-of-the-art models, further attests the architecture's generalization across different protein families. We firmly believe that MTPSol not only offers a more efficient screening method for the discovery of natural enzymes but also holds significant potential in the field of protein de novo design.