Performance of Large Language Models in Patient Complaint Resolution: Web-Based Cross-Sectional Survey

投诉医疗保健患者满意度医学横断面研究家庭医学护理部政治学病理法学

作者

Lorraine Pei Xian Yong,Joshua Yi Min Tung,Zi Yao Lee,Win Sen Kuan,Mui Teng Chua

出处

期刊：Journal of Medical Internet Research [JMIR Publications]
日期：2024-07-06 卷期号：26: e56413-e56413 被引量：5

链接

jmir.org jmir.org nih.govdoi.org

标识

DOI：10.2196/56413

摘要

Background Patient complaints are a perennial challenge faced by health care institutions globally, requiring extensive time and effort from health care workers. Despite these efforts, patient dissatisfaction remains high. Recent studies on the use of large language models (LLMs) such as the GPT models developed by OpenAI in the health care sector have shown great promise, with the ability to provide more detailed and empathetic responses as compared to physicians. LLMs could potentially be used in responding to patient complaints to improve patient satisfaction and complaint response time. Objective This study aims to evaluate the performance of LLMs in addressing patient complaints received by a tertiary health care institution, with the goal of enhancing patient satisfaction. Methods Anonymized patient complaint emails and associated responses from the patient relations department were obtained. ChatGPT-4.0 (OpenAI, Inc) was provided with the same complaint email and tasked to generate a response. The complaints and the respective responses were uploaded onto a web-based questionnaire. Respondents were asked to rate both responses on a 10-point Likert scale for 4 items: appropriateness, completeness, empathy, and satisfaction. Participants were also asked to choose a preferred response at the end of each scenario. Results There was a total of 188 respondents, of which 115 (61.2%) were health care workers. A majority of the respondents, including both health care and non–health care workers, preferred replies from ChatGPT (n=164, 87.2% to n=183, 97.3%). GPT-4.0 responses were rated higher in all 4 assessed items with all median scores of 8 (IQR 7-9) compared to human responses (appropriateness 5, IQR 3-7; empathy 4, IQR 3-6; quality 5, IQR 3-6; satisfaction 5, IQR 3-6; P<.001) and had higher average word counts as compared to human responses (238 vs 76 words). Regression analyses showed that a higher word count was a statistically significant predictor of higher score in all 4 items, with every 1-word increment resulting in an increase in scores of between 0.015 and 0.019 (all P<.001). However, on subgroup analysis by authorship, this only held true for responses written by patient relations department staff and not those generated by ChatGPT which received consistently high scores irrespective of response length. Conclusions This study provides significant evidence supporting the effectiveness of LLMs in resolution of patient complaints. ChatGPT demonstrated superiority in terms of response appropriateness, empathy, quality, and overall satisfaction when compared against actual human responses to patient complaints. Future research can be done to measure the degree of improvement that artificial intelligence generated responses can bring in terms of time savings, cost-effectiveness, patient satisfaction, and stress reduction for the health care system.

求助该文献

最长约 10秒，即可获得该文献文件

Performance of Large Language Models in Patient Complaint Resolution: Web-Based Cross-Sectional Survey

今日热心研友