作者
Pengfei Li,Xuejuan Zhang,Erjia Zhu,Shijun Yu,Bin Sheng,Yih‐Chung Tham,Tien Yin Wong,Hongwei Ji
摘要
n the crossroads of digital health and education, large language models (LLMs) emerge as tools with great potential. 1Trained on expansive textual data sets, these state-of-the-art artificial intelligence models can generate multidisciplinary content, answer intricate queries, and accelerate information delivery. 1articularly in the field of cardio-oncology, which combines cardiac and oncological expertise, LLMs have the potential to provide valuable insights to specialists like cardiologists and oncologists. 2This is useful in situations in which standard guidelines are not immediately available or when there is a need to combine a vast amount of interdisciplinary information.However, the performances of LLMs in this context remains largely unknown.This study aims to benchmark these state-of-the-art artificial intelligence models in their ability to handle the interdisciplinary queries inherent in cardio-oncology, where integrative insights from cardiology and oncology are crucial.The data that support the findings of this study are available from the last author upon reasonable request.Our study, conducted between October 02, 2023 and October 12, 2023 compiled 25 questions according to the 2022 European Society of Cardiology guideline on cardio-oncology 3 (Table ).Each query was individually and independently posed to 5 LLMs: ChatGPT-3.5,ChatGPT-4.0,Bard, Llama 2, and Claude 2, generating a total of 25 responses per chatbot.We format all generated responses as plain text and stripped of any identifying details (eg, remarks like "I'm not a doctor" from ChatGPT).Responses were randomly shuffled within their respective question sets, ensuring that the reviewers remained unaware of LLMspecific responses.Two experienced attending-level physicians independently assessed the responses in 5 separate rounds, each conducted on a distinct day, with an overnight washout period to minimize memory bias (Table ).This study did not involve human subjects; institutional review board approval and informed consent were waived.The mean±SD of the word count was 386±91 for ChatGPT-3.5,386±96 for ChatGPT-4.0,340±78 for Google Bard, 360±96 for Meta Llama 2, and 203±27 for Anthropic Claude 2 (P<0.001).The preliminary results indicated that ChatGPT-4 provided 17 of 25 (68%) appropriate responses, followed by Bard, Claude 2, and ChatGPT-3.5 with 13 of 25 (52%), and Llama 2 with 12 of 25 (48%; P=0.653).A notable area of concern was that in the treatment and prevention domain; all 5 LLM-Chatbots earned either borderline or poor