MFMGP: an integrated machine learning fusion model for genomic prediction

生物 计算生物学 融合 人工智能 机器学习 计算机科学 哲学 语言学
作者
Chaopu Zhang,Qiqi Liang,Yuye Yu,S. Jin,Jinmei Huang,Zhongping Xu,Erbao Liu,Wensheng Wang,Fan Zhang,Fangzhou Liu,Yingyao Shi,Fenge Li,Zhikang Li,Shuangxia Jin,Min Li
出处
期刊:Plant Biotechnology Journal [Wiley]
标识
DOI:10.1111/pbi.14532
摘要

Genome-wide selection (GS) represents a contemporary methodology that harnesses a comprehensive array of molecular markers across the entire genome. However, challenges such as lack of informative molecular markers and selection of appropriate and efficient GS model(s) have confined most GS-based breeding efforts to the realm of laboratory simulations (Wang et al., 2023). Compared to the conventional prediction models, the machine learning (ML) algorithm provides new insights for solving challenges such as big data analysis and high-performance parallel computing. GS using ML also has some limitations at the current stage such as limitations in model selection. Here, the MFMGP software is a fusion model that is based on a variety of ML training methods. The normalization fusion method with exponential decay weights involves assigning weights to the prediction results of each model and applying the exponential decay to these weights, so that more recent and/or more relevant model predictions have higher weights. Then, a weighted average of the model's prediction results is calculated to obtain the final fusion prediction by normalizing these weights (Figure 1a). The software of MFMGP for interactive GS analyses was made available at website: http://www.biohuaxing.com/#/MFMGP. To verify the prediction accuracy of the MFMGP model, we compared MFMGP with seven commonly used GS models. These included the classical GS model (GBLUP), four ML-based models (LightGBM, SVR, XGBoost and HGBoost) and two DL-based (DNNGP and DeepCCR) models. In rice, we utilized a natural population, which consists of 3024 (3KRG) Asian cultivated rice accessions to construct the training population (Table S1). The GS accuracy of MFMGP was compared using the phenotype datasets of 2110 rice accessions for 13 yield-related and morphological traits with over 1.0 M SNPs (Figure 1b,c; Table S2). The results of the 10-fold cross-validation (CV) indicated that MFMGP exhibited the highest prediction accuracy for all 13 tested traits, with an average accuracy of 0.53, significantly (P < 0.01) higher than that of the GBLUP model (average value = 0.36). At the same time, the prediction accuracy of MFMGP also significantly higher compared to the average of four ML models (average value = 0.45) and two DL methods (average value = 0.34) (Tables S2 and S3). Comparatively, the prediction accuracy of MFMGP had an average improved advantage of 52.9% over GBLUP, 18.4% over other all ML models, 4.2% over the best model from the four integrated ML methods and 73.3% over the DL models. Additionally, MFMGP had the smallest root mean square error (RMSE) in all 13 traits, or an average 11.1% reduced RMSE over GBLUP, 5.8% reduced RMSE over ML and 24.3% reduced RMSE over DL (Tables S2 and S4). With the sample size of 2110, the computation time using CPU (Server Configuration: Intel®X®(R)CPU E7-8860 v3 @2.20GHZ), the MFMGP model spans a slightly longer duration than the four tested ML models, but significantly shorter than the GBLUP method and DL (using GPU) methods (Table S5). We then used six traits from the 2000 Iranian bread wheat dataset to compare the prediction accuracy of the eight models using 33 709 SNPs (Figure 1d; Table S2). Compared to other seven models, the average prediction accuracy of MFMGP for all six traits was 0.65 as compared with GBLUP (0.32), DeepCCR (0.59), DNNGP (0.57), HGBoost (0.63), LightGBM (0.63), SVR (0.28) and XGBoost (0.62). The prediction accuracy of MFMGP had an average improved advantage of 2.9% over the best model from the four integrated ML methods. Using 1 122 352 SNPs and four traits from 1245 cotton accessions, MFMGP showed the highest prediction accuracy and lowest RMSE values among all methods (Figure 1e; Table S2). On average, MFMGP had an improved prediction accuracy by 12.1% and reduced RMSE by 21.9% for the four traits, when compared to the other seven methods and improved prediction accuracy by 3.5% when compared to the four integrated ML methods. Using 32 599 markers and four traits of 6210 maize samples, MFMGP showed an average prediction accuracy of 0.85, again the highest among the eight methods used, except for DTT with a similar prediction accuracy to SVR (Figure 1f; Table S2). To explore the predictive ability of MFMGP in animals, we used the IMF content phenotype and 39 614 markers of 1490 pig samples for comparing the prediction of the eight methods (Figure 1g; Table S2). MFMGP performed best among all the methods with an average improved prediction accuracy of 24.5% over GBLUP, 57.6% over the ML models, 16.2% over the best model from the four integrated ML methods and 18.5% over the DL models. To investigate the impact of trait heritability, we compared the low heritability trait data of RBSSD (H2 = 0.38) with the high heritability traits, GL (H2 = 0.94) and GW (H2 = 0.94) using MFMGP. We utilized the RBSSD phenotypic data in 2017 as the training population (n = 1277) to predict their phenotypes under two independent environments, yielding the prediction accuracies of 0.36 in 2016 (n = 606) and 0.34 in 2019 (n = 676), respectively. However, when we used the GL and GW from 2017 to predict their phenotypic values in 2015 and 2016 (n = 760), the prediction accuracy of GL and GW reached very high average values of 0.91 and 0.92, respectively (Figure 1h). The four density plots all showed that the angles between the y = x and the fitted regression line were very small in the repeated experiments across different environments (Figure S1). To verify the influence of subspecific differences on GS accuracy, we randomly selected two subgroups with the same number accessions (n = 500) from Xian and Geng. We used MFMGP to analyse two representative traits (GW and HD), and found that the prediction accuracy of Geng was higher than that of Xian for GW, but the opposite was true for HD. Additionally, we used the Xian subgroup as the training population to predict the accuracy of the Geng subgroup, as well as used the Geng as the training population to test the prediction accuracy of the Xian. The results showed that the prediction accuracy of one subgroup for another was extremely low (Figure S2A). The same cautions should be taken when GS is to be applied to breeding for disease resistance. As Figure S2B clearly demonstrated, the highly virulent race (V) had a much higher prediction accuracy than the weak virulent races C4 and C5. To verify the impact of different population sizes on GS, we randomly selected nine accession numbers for GS. The GS analysis results showed that the prediction accuracy of the trait improved gradually with the increase of population sizes (Figure 1i). In summary, we developed a ML fusion model for predicting the phenotypes of breeding populations for complex traits using GS. Compared with other methods, MFMGP was proven to have the following advantages. (1) Improved prediction accuracy: MFMGP was able to integrate the strengths of many classical models and reduce the biases associated with single classical models. (2) Reduced overfitting: MFMGP was able to mitigate the problem of overfitting training data commonly encountered by other single models. (3) Enhanced generalization ability: MFMGP could better capture the complex patterns and diversity in the data. (4) Robustness to errors: MFMGP could effectively reduce prediction errors due to anomalies or specific circumstances by single models through synthesizing the predictions of multiple models. (5) Exploitation of model complementarity. Currently, most GS experiments focus on predicting performances of single traits of specific populations in specific environments, neglecting the fact that most plant and animal breeding programmes are aiming at improving multiple target traits across target environments (particularly plants). The most significant factors affecting predictive accuracy are heritability and sample size. As the key parameter of the genotype–phenotype association, the higher a trait's heritability is, the more accurate a GS model would be. Conversely, low heritability leads to lower model prediction accuracy. Insufficient sample size reduces representativeness of the training population due to increased sampling error, resulting in biased estimates of genetic parameters and reduced prediction accuracy. Thus, it is necessary to collect more phenotypes of training populations of appropriate sizes across multiple target environments such that trait genetic effects and their interactions with environments can be adequately estimated and integrated into the MFMGP model. As the plant and animal functional and population genomic research progress rapidly, the greatest challenge is how to integrate accurate functional information of many genes and allelic effects on target traits into the MFMGP model in GS applications in plant and animal breeding and eventually realizing breeding by design in future. This work was supported by the National Natural Science Foundation of China (U21A20214), Natural Science Foundation of Anhui Province (2308085QC91) and National Natural Science Foundation of China (32301783 and 32101768) (Innovation Program of the Chinese Academy of Agricultural Sciences (CAAS-CSIAF-202303); Nanfan special project, CAAS (YYLH2309, YBXM2322, YYLH2401)). The authors declare no conflicts of interest. Z.L. and S.J. designed the experiments. J.H., S.J., W.W., F.Z., E.L. and Y.S. provided the phenotype data and performed the statistical analysis. C.Z., Q.L., Y.Y., F.L., Z.X. and F.L. performed the bioinformatic analyses. C.Z., M.L. and Z.L. wrote the manuscript. The data that support the findings of this study are available on request from the corresponding author upon reasonable request. Table S1–S5 Supplementary Tables. Figure S1–S3 Supplementary Figures. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
奥斯卡完成签到,获得积分0
2秒前
七个小矮人完成签到,获得积分10
2秒前
莉莉斯完成签到 ,获得积分10
3秒前
李健应助端庄南莲采纳,获得30
3秒前
wanci应助jun采纳,获得10
4秒前
大冰完成签到,获得积分10
6秒前
共享精神应助一沙采纳,获得10
7秒前
兰真纯洁发布了新的文献求助10
9秒前
田様应助秀丽的冬瓜采纳,获得10
9秒前
科研通AI5应助糯米团子采纳,获得10
11秒前
5yy完成签到,获得积分20
14秒前
17秒前
21秒前
21秒前
秀丽的冬瓜完成签到,获得积分10
23秒前
ymX完成签到,获得积分10
24秒前
Tangyartie完成签到 ,获得积分10
24秒前
小金星星完成签到 ,获得积分10
24秒前
nenoaowu完成签到,获得积分10
25秒前
April完成签到,获得积分10
26秒前
26秒前
26秒前
26秒前
光轮2000发布了新的文献求助10
31秒前
36秒前
123完成签到,获得积分10
38秒前
Lucas应助李爱笑采纳,获得10
39秒前
shanbaibai发布了新的文献求助100
40秒前
41秒前
ding应助birdy采纳,获得10
41秒前
糯米团子发布了新的文献求助10
42秒前
乖小俏完成签到,获得积分10
45秒前
万能图书馆应助参上采纳,获得10
48秒前
天天快乐应助ting采纳,获得10
48秒前
49秒前
无奈皮卡丘完成签到 ,获得积分10
50秒前
糯米团子完成签到,获得积分10
50秒前
鹿子默完成签到,获得积分10
50秒前
50秒前
51秒前
高分求助中
Assessing and Diagnosing Young Children with Neurodevelopmental Disorders (2nd Edition) 700
The Elgar Companion to Consumer Behaviour and the Sustainable Development Goals 540
The Martian climate revisited: atmosphere and environment of a desert planet 500
Images that translate 500
Transnational East Asian Studies 400
Towards a spatial history of contemporary art in China 400
Mapping the Stars: Celebrity, Metonymy, and the Networked Politics of Identity 400
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3843913
求助须知:如何正确求助?哪些是违规求助? 3386217
关于积分的说明 10544489
捐赠科研通 3107034
什么是DOI,文献DOI怎么找? 1711392
邀请新用户注册赠送积分活动 824081
科研通“疑难数据库(出版商)”最低求助积分说明 774434