Real‐world deployment and evaluation of PEri ‐operative AI CHatbot ( PEACH ): a large language model chatbot for peri‐operative medicine
作者
Yuhe Ke,Liyuan Jin,Kabilan Elangovan,Bee Suan Ong,Chin Yang Oh,Jacqueline Sim,Kenny Loh,Chai Rick Soh,Jonathan Cheng,Aaron Kwang Yang Lee,Daniel Shu Wei Ting,Nan Liu,Hairil Rizal Abdullah
Summary Introduction Large Language Models are emerging as powerful tools in healthcare, particularly for complex, domain‐specific tasks. This study describes the development and evaluation of PEri‐operative AI CHatbot (PEACH). It was developed by embedding 35 institutional peri‐operative protocols into a secure large language model environment, with iterative prompt engineering and internal testing to ensure clinical relevance and accuracy. Methods The system was tested with a silent deployment using real‐world data. Accuracy, safety and usability were assessed. Accuracy was evaluated by comparing the responses from PEACH against institutional guidelines and expert consensus. Deviations and hallucinations were categorised based on potential harm, and user feedback was evaluated using the Davis' Technology Acceptance Model. Updates to PEACH were made after the initial silent deployment to make minor amendments to one of the protocols. Results In total, 240 real‐world clinical iterations were evaluated. First‐generation accuracy was 97.5% (78/80), with an overall accuracy of 96.7% (232/240) across three iterations. In the updated PEACH, accuracy improved to 97.9% (235/240), with a statistically significant difference from the null hypothesis of 95% accuracy (p = 0.018). Hallucinations and deviations were minimal (1/240 and 2/240, respectively). There was high usability, with clinicians noting that PEACH expedited decisions in 95% of cases. The κ statistic for inter‐rater reliability for PEACH was 0.772 and 0.893 between three iterations, compared with 0.610 and 0.784 for experienced peri‐operative physicians. Discussion PEACH is an accurate, adaptable tool that enhances consistency and efficiency in peri‐operative decision‐making. Future research should explore scalability across specialties and its impact on clinical outcomes.