摘要
ABSTRACT Aim To pioneer the first comprehensive benchmarking of dental students against 22 multimodal Artificial Intelligence ( AI ) models spanning six major architectures (Anthropic, DeepSeek , Google Gemini, OpenAI , Meta Llama, Mistral/Qwen) in emergency permanent tooth avulsion management, quantifying performance differences across Bloom's Taxonomy higher‐order cognitive domains (Apply, Analyse, Evaluate) against International Association of Dental Traumatology ( IADT ) guidelines to optimise evidence‐based educational integration. Methodology This cross‐sectional study compared 35 fifth‐year dental students with 22 multimodal AI models grouped by producer (Anthropic, DeepSeek , Google Gemini, OpenAI , Meta, Mistral/Qwen) using a validated clinical vignette featuring a 14‐year‐old with three avulsed mature permanent teeth (#11, 12, 22) with clinical photograph. Performance was assessed through four questions (1 Apply, 2 Analyse, 1 Evaluate) via an IADT ‐aligned rubric validated against paediatric dentistry standards. Responses were independently scored by two blinded evaluators. Subsequently, scores were compared using ANOVA with post hoc Tukey HSD , Kruskal‐Wallis, and chi‐square tests ( SPSS v20, α = 0.05). Results Overall performance analysis revealed significant group differences ( p = 0.012, partial η 2 = 0.27). Only students achieved high‐acceptability A‐level performance (17.1%, χ 2 = 39.27, p = 0.003), despite top AI models (Claude‐Sonnet‐3.7‐Reasoning, DeepSeek ‐ R1 : 7.0/10) marginally exceeding student mean scores (6.5/10). Performance varied among AI systems, with students significantly outperforming Meta Llama models (mean difference = 3.50, p = 0.014), which showed exclusively not acceptable (D‐level) performance. Students dominated Bloom's Analyse level (visual diagnosis: 100% tooth identification vs. 77.3% AI failure; age‐specific management differentiation, p < 0.001), while AI excelled in Apply/Emergency steps (medical management: 81.8% perfect scores) and Evaluate/prosthetic contraindication reasoning (Anthropic: 0.92 ± 0.20 vs. students: 0.41 ± 0.33, p = 0.005). Critical AI deficiencies included tooth misidentification (77% of models), failure to mention follow‐up schedules for immature replanted teeth (100%), and inappropriate intervention recommendations (40.9% suggested unsuitable options like resin‐bonded bridges). For soft tissue management, only 23% of AI models addressed the need to suture gingival lacerations. Conclusions AI 's fundamental limitations in visual diagnosis, IADT misinterpretations, and inadequate clinical translation mandate strict positioning as a supplemental protocol accelerator—not an independent diagnostic tool. Educational integration requires: (1) Careful model selection and supervision for any visual triage applications; (2) Mandatory IADT verification; and (3) Preservation of human clinical reasoning as the irreplaceable cornerstone of permanent tooth avulsion management.