Gene Disease Classification from Biomedical Text via Ensemble Machine Learning
计算机科学
集成学习
人工智能
机器学习
作者
Rabea F. Ghazi,Dhafar Hamed
标识
DOI:10.1109/dese60595.2023.10469108
摘要
Detecting connections between genes and diseases is a vital endeavor in bioinformatics and genomics, carrying significant implications for the entire comprehension of the molecular underpinnings of various diseases. The rapid increase in the number of documents in the field of biomedicine has resulted in a significant burden and time requirement for manually curating relationships within this literature. In order to tackle this particular difficulty, the present study introduced a resilient ensemble machine-learning methodology that aimed at automating the classification of gene-disease relationships through the analysis of biomedical text. The proposed model was meant to leverage ensemble learning capabilities by integrating different base classifiers which are Decision Trees, Random Forest, AdaBoost, Bagging, CatBoost, Extra Trees and XGBoost with two feature extraction TF and TF-IDF. This ensemble architecture aimed to enhance the accuracy and dependability of gene-disease association predictions by utilizing a wide range of variables obtained from biomedical literature, including abstracts and various ensemble configurations and evaluating performance using standard metrics which are precision, recall, and F1-score, AUC, and accuracy. The study findings provided evidence supporting the efficacy of the ensemble methodology in enhancing both accuracy and resilience when compared to the performance of individual classifiers. The highest accuracy was achieved with XGBoost and TF 0.979%.