Abstract Multimodal endoscopy imaging includes two or more imaging modes, for example, optical coherence tomography (OCT), photoacoustic imaging (PAI), microscopy endoscopy, fluorescence imaging (FI), etc. The combination of OCT and PAI obtained comprehensive tissue information with high resolution and a certain imaging depth of a few millimeters, which has the potential to determine the boundary of the tumor and improve the early diagnosis rate. However, they have weak molecular specificity. FI could reveal the changes in the composition and have the ability of targeted recognition. However, the imaging depth and resolution are limited. Super‐resolution imaging could improve the lateral resolution and enable observation of subcellular structures. Multimodal endoscopic imaging could enhance diagnosis and treatment by capturing comprehensive tissue information, monitoring photodynamic therapy efficacy, and revealing targeted changes. It also could accelerate the development of new methods that achieve high resolution, high depth, and high specificity. Here, the progress of research and medical applications of multimodal endoscopy imaging is reviewed. First, the principle and performance of each mode are discussed. Then, representative implementations, research progress, and medical applications of each subsystem and multimodal endoscopy imaging are illustrated. Finally, the current challenges and potential development of multimodal endoscopy imaging are briefly introduced.