Are clinical improvements in large language models a reality? Longitudinal comparisons of ChatGPT models and DeepSeek-R1 for psychiatric assessments and interventions

心理干预渐晕背景（考古学）心理学精神科医学临床心理学社会心理学生物古生物学

作者

Alexander Smith,Michael Liebrenz,Dinesh Bhugra,Juan Graña,Roman Schleifer,Anna Buadze

出处

期刊：International Journal of Social Psychiatry [SAGE Publishing]
日期：2025-07-31

链接

nih.govdoi.org

标识

DOI：10.1177/00207640251358071

摘要

Background: Potential clinical applications for emerging large-language models (LLMs; e.g. ChatGPT) are well-documented, and newer systems (e.g. DeepSeek) have attracted increasing attention. Yet, important questions endure about their reliability and cultural responsiveness in psychiatric settings. Methods: This study explored the diagnostic accuracy, therapeutic appropriateness and cultural sensitivity of ChatGPT-4o, ChatGPT-4.5, and DeepSeek-R1 (all March 2025 versions). DeepSeek-R1 was evaluated for one of the first times in this context, and this also marks one of the first longitudinal inquiries into LLMs in psychiatry. Three psychiatric cases from earlier literature about sleep-related problems and cooccurring issues were utilised, allowing for cross-comparisons with a 2023 ChatGPT version, alongside culturally-specific vignette adaptations. Thus, overall, outputs for six scenarios were derived and were subsequently qualitatively reviewed by four psychiatrists for their strengths and limitations. Results: ChatGPT-4o, ChatGPT-4.5, and DeepSeek-R1 showed modest improvements from the 2023 ChatGPT model but still exhibited significant limitations. Communication was empathetic and non-pharmacological advice typically adhered to evidence-based practices. Primary diagnoses were broadly accurate but often omitted somatic factors and comorbidities. Nevertheless, consistent with past findings, clinical reasoning worsened as case complexity increased; this was especially apparent for suicidality safeguards and risk stratification. Pharmacological recommendations frequently diverged from established guidelines, whilst cultural adaptations remained largely superficial. Finally, output variance was noted in several cases, and the LLMs occasionally failed to clarify their inability to prescribe medication. Conclusion: Despite incremental advancements, ChatGPT-4o, ChatGPT-4.5 and DeepSeek-R1 were affected by major shortcomings, particularly in risk evaluation, evidence-based practice adherence, and cultural awareness. Presently, we conclude that these tools cannot substitute mental health professionals but may confer adjunctive benefits. Notably, DeepSeek-R1 did not fall behind its counterparts, warranting further inquiries in jurisdictions permitting its use. Equally, greater emphasis on transparency and prompt engineering would also be necessary for safe and equitable LLM deployment in psychiatry.

求助该文献

Are clinical improvements in large language models a reality? Longitudinal comparisons of ChatGPT models and DeepSeek-R1 for psychiatric assessments and interventions

今日热心研友