Appropriateness of Thyroid Nodule Cancer Risk Assessment and Management Recommendations Provided by Large Language Models

甲状腺癌结核（地质）甲状腺结节风险评估医学风险管理甲状腺风险分析（工程）医学物理学重症监护医学计算机科学内科学业务生物计算机安全古生物学财务

作者

Mohammad Alarifi

链接

标识

摘要

The study evaluates the appropriateness and reliability of thyroid nodule cancer risk assessment recommendations provided by large language models (LLMs) ChatGPT, Gemini, and Claude in alignment with clinical guidelines from the American Thyroid Association (ATA) and the National Comprehensive Cancer Network (NCCN). A team comprising a medical imaging informatics specialist and two radiologists developed 24 clinically relevant questions based on ATA and NCCN guidelines. The readability of AI-generated responses was evaluated using the Readability Scoring System. A total of 322 radiologists in training or practice from the United States, recruited via Amazon Mechanical Turk, assessed the AI responses. Quantitative analysis using SPSS measured the appropriateness of recommendations, while qualitative feedback was analyzed through Dedoose. The study compared the performance of three AI models ChatGPT, Gemini, and Claude in providing appropriate recommendations. Paired samples t-tests showed no statistically significant differences in overall performance among the models. Claude achieved the highest mean score (21.84), followed closely by ChatGPT (21.83) and Gemini (21.47). Inappropriate response rates did not differ significantly, though Gemini showed a trend toward higher rates. However, ChatGPT achieved the highest accuracy (92.5%) in providing appropriate responses, followed by Claude (92.1%) and Gemini (90.4%). Qualitative feedback highlighted ChatGPT's clarity and structure, Gemini's accessibility but shallowness, and Claude's organization with occasional divergence from focus. LLMs like ChatGPT, Gemini, and Claude show potential in supporting thyroid nodule cancer risk assessment but require clinical oversight to ensure alignment with guidelines. Claude and ChatGPT performed nearly identically overall, with Claude having the highest mean score, though the difference was marginal. Further development is necessary to enhance their reliability for clinical use.

求助该文献

最长约 10秒，即可获得该文献文件

Appropriateness of Thyroid Nodule Cancer Risk Assessment and Management Recommendations Provided by Large Language Models

今日热心研友