Rapid identification of chronic kidney disease in electronic health record database using computable phenotype combining a common data model

鉴定(生物学) 肾脏疾病 医学 表型 疾病 电子健康档案 数据库 计算机科学 生物信息学 数据挖掘 内科学 生物 遗传学 基因 植物 医疗保健 经济 经济增长
作者
Huaiyu Wang,Juan Du,Yu Yang,Hongbo Lin,Bin Bao,Guohui Ding,Chao Yang,Guilan Kong,Luxia Zhang
出处
期刊:Chinese Medical Journal [Lippincott Williams & Wilkins]
卷期号:136 (7): 874-876 被引量:1
标识
DOI:10.1097/cm9.0000000000002168
摘要

To the Editor: Chronic kidney disease (CKD) is a global burden of the public health. The global prevalence of CKD exceeded 10% while the awareness was around 10%.[1] In the era of big data, improving the identification of CKD using informatic tools is important. Computable phenotype is proven as an efficient tool to facilitate the process of patient identification using electronic health record (EHR) data. It is an automatic algorithm identifying the target population through objective criteria with logic statements. Effective implementation of a computable phenotype depends on valid mapping of raw data to a standard set of data and definitions. Previous studies developed computable phenotypes for CKD identification in English by using the Logical Observation Identifiers Names and Codes (LOINC) and the International Classification of Diseases (ICD) codes.[2,3] With the limited utilization of these codes and the language barrier, implementing these computable phenotypes in non-English circumstances and/or in the absence of identical coding system is difficult. Common data model (CDM) was reported as a solution for data standardization and the localization of computable phenotypes.[4] The core of CDM is the extraction of key elements, transforming into a standard terminology and loading into a standard schema extraction, transformation, loading (ETL). Currently, various CDMs with different original aims, such as the Observational Medical Outcomes Partnership CDM, Sentinel CDM, and the Patient-Centered Outcomes Research Network CDM, had been widely used and successfully facilitated the standardization of EHR data. Sentinel previously posted coding trend analyses on kidney disease, and only ICD-9 codes and ICD-10 codes were included. The CDM for CKD characterization was still lacking. The confirmation of CKD takes at least 3 months. This condition hinders the timely diagnosis and increases the missed diagnosis of CKD in clinical practice, especially for patients seeking health care in different institutes.[5] EHR database collects healthcare data continuously across institutes and updates those in real time. Monitoring and identifying the patients with CKD by using an informatic tool based on this database are promising. Collectively, speculating that a computable phenotype combining a CDM might facilitate the CKD-related data extraction and CKD identification using EHR data is reasonable. Yinzhou is a district with a population of 1.6 million people located in Ningbo Zhejiang province, China. The Regional Health Information System (RHIS) in Yinzhou collected EHRs of residents and updated the database in real time. Using this database, a unique identity code (PERSONKEY) was generated by using personal ID, sex, date of birth, and name and was adopted to recognize the identical person, link the health profiles in different sub-databases, and generate the complete EHRs. The EHRs of 976,409 adults with medical records were extracted as the raw data for the following analyses [Supplementary Figure 1, https://links.lww.com/CM9/B73]. This study was approved by the ethics committee of Peking University First Hospital. The CDM for CKD characterization was designed in accordance with the principles described in The Book of OHDSI: Observational Health Data Sciences and Informatics. In accordance with the Kidney Disease: Improving Global Outcomes (KDIGO) clinical guidelines for CKD (2012), the key elements for CKD identification were defined as age, sex, kidney function, and urine abnormality.[6] Hence, Data Domain of CDM for CKD identification was designed as demographics, laboratory tests, and diagnosis. Standard terminology of data domains was defined in accordance with the KDIGO-CKD clinical guidelines and ICD-10 codes in English and in Chinese. Forms containing demographics (age, sex), laboratory tests (kidney function, albuminuria, proteinuria, hematuria), and diagnosis (ICD-10 codes and texts) in the EHR database were integrated by PERSONKEY. Altogether, 10,981,723 medical records of 976,409 individuals in the EHR database were prepared for the extraction of original vocabularies [Supplementary Figure 1, https://links.lww.com/CM9/B73]. The mapping rules between original vocabularies and the standard terminology were established through manual annotation and format conversion. Two nephrologists independently conducted the annotation and one informaticist performed the mapping [Figure 1].Figure 1: Process of the development of CDM for CKD characterization and computable phenotype for CKD identification. CDM: Common data model; CKD: Chronic kidney disease; eGFR: Estimated glomerular filtration rate; EHR: Electronic health record; ICD: International Classification of Diseases.The algorithm of the computable phenotype for CKD identification was designed in accordance with KDIGO clinical guidelines for CKD[6] [Figure 1]. On the basis of the standard terminology of CDM, patients showing at least one of the following manifestations lasting for >3 months were defined as having CKD: (1) reduced kidney function: estimated glomerular filtration rate (eGFR) <60 mL·min−1 · 1.73 m−2); (2) albuminuria: urine albumin-to-creatinine ratio ≥30 mg/g or urine albumin concentration ≥20 mg/L; (3) proteinuria: urine protein-to-creatinine ratio ≥150 mg/g, or 24 h proteinuria ≥150 mg/24 h, or urinalysis protein ≥+1; (4) hematuria without non-CKD related causes including urologic neoplasms, urinary tract infection and injury. Criteria for hematuria: urine red blood cell ≥3 cells/HPF (or >28 cells/μL) or urine occult blood ≥+2; (5) CKD-related diagnosis including primary, secondary or congenital kidney disease, renal vascular disease, maintenance dialysis and recipient/donor of kidney transplantation [Supplementary Table 1, https://links.lww.com/CM9/B73]. Patients who received re-tests over a period of 3 months and were confirmed with the absence of the abovementioned manifestations were defined as normal cases. Patients who presented these manifestations for ≤3 months or did not receive any re-test were defined as cases to be addressed and will be processed in the next iteration of CKD identification. [Figure 1]. In accordance with the number of individuals with EHRs and considering the diversity of EHR infrastructures and data sources, seven institutes were selected from 42 healthcare institutes in Yinzhou to implement the computable phenotype based on the CDM. In total, three tertiary general hospitals, two specialty hospitals (a maternity and children's hospital and an orthopedic hospital), one secondary general hospital, and one community health center were selected. The performance of the computable phenotype was validated through manual review. Cases identified as with/without CKD were randomly selected, and their original records of demographics, diagnosis, and laboratory tests were manually reviewed by two nephrologists. For those without CKD, all diagnosis and CKD-related laboratory tests in the database were extracted and manually reviewed. For those with CKD, all diagnosis and laboratory tests from the date of presentation of CKD to the endpoint of the database were extracted and manually reviewed. Panel discussion was held when they have different opinions. Review by nephrologists was defined as the gold standard for CKD identification. The data processing and computation in the RHIS were based on the Hadoop framework. The computing engine was Spark, and the data warehouse was Hive as the support for structured query language (SQL) (The Apache Software Foundation, Wakefield, United Kingdom). The ETL process of CDM and the implementation of the computable phenotype were conducted using SQL statements. The demographic and clinical characteristics of CKD-identified patients were analyzed. The stages of CKD-identified patients were evaluated in terms of the levels of eGFR and presented in G1–G5. Continuous and categorical variables were presented as mean ± standard deviation and frequency, respectively. The performance of the computable phenotype was evaluated in terms of sensitivity, specificity, and accuracy and analyzed using MedCalc 15.8 (MedCalc Software Ltd., Ostend, Belgium). The standard terminology for CKD characterization is shown in Figure 1. The bilingual terminology is presented in Supplementary Table 2, https://links.lww.com/CM9/B73. A total of 617 original vocabularies for laboratory tests were found and standardized by processing 10,981,723 medical records of 976,409 individuals from 42 medical institutes. The formats of date, categorical data, and unit of test were converted. By manual annotation, 111 types of diagnosis (corresponding to 171 types of ICD-10 codes in English and Chinese versions) including primary, secondary and congenital kidney disease, renal vascular disease, and uremia-related diagnosis were reorganized as CKD-related diagnosis. [Supplementary Table 1, https://links.lww.com/CM9/B73] By scanning 21,474,008 records of laboratory tests and diagnoses of 557,719 individuals in seven medical institutes, 64,036 (11.5%) patients with CKD were identified by the computable phenotype. In China, patients commonly seek health care across different institutes. Thus, the EHRs of more than half of residents in the whole database were extracted from the seven representative institutes. Among them, 55,682 (87.0%) patients received serum creatinine tests. The majority of patients were in early stages (G1: 33,315 cases [59.8%]; G2: 12,980 cases [23.3%]). Patients in G1 were the youngest (53.7 ± 14.0 years), whereas patients in G4 were the oldest (82.3 ± 14.6 years). The highest proportion of hematuria and albuminuria/proteinuria was observed in G1 (17,187 cases [51.6%]) and G5 (417 cases [51.3%]), respectively. The frequency of patients labeled with CKD-related ICD-10 code increased from G1 (16,795 cases[50.4%]) to G5 (737 cases [90.7%]) [Supplementary Table 3, https://links.lww.com/CM9/B73]. In total, the EHRs of 50 CKD-identified cases and 50 cases without CKD were randomly sampled and reviewed by two nephrologists. Fifty CKD-identified cases were confirmed as disease present and three cases without CKD were defined as mis-classified because they did not meet the criterion of re-testing over 3 months. The sensitivity, specificity, and accuracy of the computable phenotype for CKD identification were 94.3%, 100.0%, and 97.0%, respectively [Supplementary Table 4, https://links.lww.com/CM9/B73]. Compared with the previous models, the present computable phenotype particularly considered the utilization of existing non-uniform data and its capacity of localization across databases with different settings. Nadkarni et al[3] developed a computable phenotype to identify patients with CKD in the population with diabetes and/or hypertension based on eMERGE network. Their algorithm mainly relied on ICD-9 codes. Hence, the performance of their computable phenotype was influenced by the missing rate of diagnosis records and/or the awareness. Norton et al[2] developed an NKDEP e-phenotype for CKD identification using laboratory tests, which were extracted through LOINC. Obviously, National Kidney Disease Education Program (NKDEP) e-phenotype avoided the influence of diagnosis rate effectively, but its dependence of LOINC limited the localization in a database without LOINC. The algorithm of the present computable phenotype combined CKD-related diagnostic records and laboratory tests to improve the data utilization and the identification rate. The terminology of the CDM preferred standard description rather than a coding system, so as to reserve the potential for further expansion in foreign databases in the absence of the identical coding system. In accordance with the present results of implementation, the EHR data in different levels of healthcare institutes were scanned successfully and the prevalence of CKD and the characteristics of identified-CKD patients were consistent with previous nationally representative study.[7] This condition demonstrated the effectiveness of the design embedding a CDM into the computable phenotype. The present study established a reproducible paradigm for the design and construction of CDM and computable phenotype in other fields and databases. First, slightly expanding the criteria for disease identification based on the standard definition of the disease is allowable to balance the utilization of data and the rate of identification. Second, embedding a CDM into the computable phenotype can improve the efficiency of its implementation across different databases. Third, a CDM containing non-monotonic terminology will increase the potentiality for the localization. Finally, the correspondence between the English and Chinese terminologies can be the interface to link the data in Chinese and the existing resources and techniques in English. This strategy may be feasible to promote the data extraction and information exchange in other languages. The present study is the first research to establish a computable phenotype for CKD identification based on the CDM with a bilingual terminology for CKD characterization. This study develops an efficient tool for CKD identification based on a real-world EHR database and provides a potential interface, the CDM, for the generalization of the computable phenotype across English and Chinese settings of database. Funding This study was supported by grants from the National Natural Science Foundation of China (Nos. 82100741, 82003529, 91846101, 81771938, 81900665, 82090021), Beijing Municipal Science and Technology Commission (Grant No. 7212201), the University of Michigan Health System-Peking University Health Science Center Joint Institute for Translational and Clinical Research (Nos. BMU2020JI011, BMU2019JI005, BMU2018JI012), Beijing Nova Programme Interdisciplinary Cooperation Project (No. Z191100001119008), National Key R&D Program of the Ministry of Science and Technology of China (No. 2019YFC2005000), the National Key Research and Development Program of China (No. 2018AAA0102100), PKU-Baidu Fund (Nos. 2020BD005, 2019BD017), and CAMS Innovation Fund for Medical Sciences (No. 2019-I2M-5-046). Conflicts of interest None.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
1秒前
hunter完成签到,获得积分10
2秒前
阿弥陀佛完成签到 ,获得积分10
4秒前
king_counter完成签到,获得积分10
5秒前
陈子仪发布了新的文献求助10
5秒前
lanxinge完成签到,获得积分10
5秒前
6秒前
6秒前
愿好完成签到,获得积分10
8秒前
在水一方应助高大的水壶采纳,获得10
9秒前
10秒前
千寻完成签到,获得积分10
10秒前
哈哈哈发布了新的文献求助30
11秒前
11秒前
陈子仪完成签到,获得积分10
11秒前
慕青应助轻松凉面采纳,获得10
11秒前
乐观秋荷应助明亮的飞松采纳,获得10
12秒前
12秒前
lss完成签到,获得积分10
12秒前
xx发布了新的文献求助10
14秒前
terrell完成签到,获得积分10
14秒前
14秒前
zhangyulu完成签到,获得积分10
15秒前
丘比特应助光亮发卡采纳,获得10
15秒前
18286781431完成签到 ,获得积分10
16秒前
16秒前
坦率德地发布了新的文献求助10
16秒前
17秒前
J_Man发布了新的文献求助10
17秒前
17秒前
自然映安完成签到,获得积分10
18秒前
梦or夢完成签到 ,获得积分10
18秒前
18秒前
英姑应助ardejiang采纳,获得20
19秒前
zhangyulu发布了新的文献求助10
19秒前
言亦云完成签到,获得积分10
21秒前
xinxinxiangyong完成签到,获得积分10
22秒前
22秒前
脑洞疼应助诉与山风听采纳,获得10
22秒前
沫沫发布了新的文献求助10
23秒前
高分求助中
The Wiley Blackwell Companion to Diachronic and Historical Linguistics 3000
Standards for Molecular Testing for Red Cell, Platelet, and Neutrophil Antigens, 7th edition 1000
HANDBOOK OF CHEMISTRY AND PHYSICS 106th edition 1000
ASPEN Adult Nutrition Support Core Curriculum, Fourth Edition 1000
Signals, Systems, and Signal Processing 610
脑电大模型与情感脑机接口研究--郑伟龙 500
GMP in Practice: Regulatory Expectations for the Pharmaceutical Industry 500
热门求助领域 (近24小时)
化学 材料科学 医学 生物 纳米技术 工程类 有机化学 化学工程 生物化学 计算机科学 物理 内科学 复合材料 催化作用 物理化学 光电子学 电极 细胞生物学 基因 无机化学
热门帖子
关注 科研通微信公众号,转发送积分 6294024
求助须知:如何正确求助?哪些是违规求助? 8111696
关于积分的说明 16975353
捐赠科研通 5356755
什么是DOI,文献DOI怎么找? 2846193
邀请新用户注册赠送积分活动 1823469
关于科研通互助平台的介绍 1678831