Background/aims This study aimed to develop a multimodal artificial intelligence (AI) system that integrates fundus imaging and patient questionnaire data to achieve clinician-level diagnostic accuracy for diagnosing retinal detachment (RD). Methods Ultra-widefield fundus images and comprehensive patient questionnaires were collected from patients with RD and healthy controls at Tsukazaki Hospital. A multimodal model was developed using the Contrastive Language–Image Pretraining framework to classify RD cases, alongside separate image-only and questionnaire-only models for comparison. Per-image and per-subject analyses were conducted to assess model performance. Results The multimodal model outperformed single-modal models in per-image and per-subject assessments. It achieved accuracy, recall and F1 scores of 0.899±0.054, 0.902±0.043 and 0.902±0.048 in the per-image analysis and 0.893±0.071, 0.949±0.044 and 0.873±0.074 in the per-subject analysis, respectively. The AI model’s overall diagnostic accuracy was slightly lower than that of human clinicians; however, it exhibited a higher recall rate, indicating improved detection of true RD cases. Conclusion Integrating fundus imaging with patient questionnaire data significantly improves AI-based RD diagnosis. Future research needs to focus on expanding the dataset and refining the questionnaire design to further improve model performance.