Performance of large language models in preoperative and postoperative counselling for aesthetic facial procedures

可读性移情社会心理的医学邦费罗尼校正心理干预心理学临床心理学考试（生物学）阅读（过程）英语一致性客观试验答辩人患者满意度鉴定（生物学）面部表情梅德林流利医学物理学可靠性（半导体）听力学印为红字的理解力物理疗法正颌外科一致性（知识库）

作者

Bruce Kepler Frutuoso Maia,Éverton Freitas de Morais,Thiago de Santana Santos,Luís Eduardo Charles Pagotto

出处

期刊：British Journal of Oral & Maxillofacial Surgery [Elsevier BV]
日期：2026-01-07 卷期号：64 (3): 216-222

链接

nih.govdoi.org

标识

DOI：10.1016/j.bjoms.2026.01.002

摘要

Large language models (LLMs) are increasingly used in healthcare, but their role in aesthetic surgical procedures remains unexplored. These interventions present unique challenges, marked by high patient expectations, emotionally charged decision-making, and subtle yet impactful outcomes on self-perception and psychosocial health. This cross-sectional in silico study evaluated the performance of ChatGPT-4 (OpenAI, 2025), DeepSeek V3 (DeepSeek AI/High-Flyer, 2025), and Gemini 2.5 Pro Experimental (Google, 2025) in preoperative and postoperative counselling for aesthetic facial surgery. Twenty-six standardised patient-oriented questions were submitted, and the anonymised responses of the chatbots were independently assessed by two calibrated oral and maxillofacial surgeons across four domains: accuracy, empathy, readability (Flesch-Kincaid Reading Ease (FKRE) and Grade Level (FKGL)), and referencing reliability (including the identification of fabricated or non-verifiable citations, a phenomenon referred to as "hallucination" in LLM outputs). Statistical tests included Kruskal-Wallis, Mann-Whitney U with Bonferroni correction, Spearman correlation, and chi-squared. DeepSeek achieved the highest accuracy (4.77 (0.51), p = 0.0078) and readability (FKRE 2.92 (0.27), p < 0.00001), while Gemini outperformed in empathy (4.08 (0.89), p < 0.001). GPT-4 produced the most hallucinated citations (36%) compared with Gemini (14%) and DeepSeek (8.8%) (p < 0.00001). A negative correlation between empathy and readability (r = -0.34, p = 0.002) suggested a trade-off between affective tone and accessibility. Overall, LLMs generated satisfactory counselling responses with distinct performance profiles, supporting their potential in patient-centred communication while reinforcing the need for human oversight.

求助该文献

最长约 10秒，即可获得该文献文件

Performance of large language models in preoperative and postoperative counselling for aesthetic facial procedures

今日热心研友