INTRODUCTION: Large language models (LLMs) are increasingly used in clinical decision-making, but their role in appropriateness-based indications is uncertain. METHODS: We conducted a vignette-based Italian survey comparing EGD appropriateness by five AI (ChatGPT-4.0, ChatGPT-4.5, Gemini, Claude AI, OpenEvidence) at two time time (April and September 2025) with gastroenterologists, residents, and general practitioners, using ESGE/ASGE guidelines. RESULTS: A total of 135 physicians participated. AI performance varied over time: accuracies ranged from 50–90% in April to 63–80% in September, with ChatGPT-4.5 and ChatGPT-4.0 outperforming physicians in September. DISCUSSION: Temporal and prompt variability highlight the need for multi-run, longitudinal evaluation before clinical adoption.