摘要
Method20 June 2018Open Access Transparent process Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets Ricard Argelaguet Ricard Argelaguet orcid.org/0000-0003-3199-3722 European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK Search for more papers by this author Britta Velten Britta Velten orcid.org/0000-0002-8397-3515 European Molecular Biology Laboratory (EMBL), Heidelberg, Germany Search for more papers by this author Damien Arnol Damien Arnol orcid.org/0000-0003-2462-534X European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK Search for more papers by this author Sascha Dietrich Sascha Dietrich orcid.org/0000-0002-0648-1832 Heidelberg University Hospital, Heidelberg, Germany Search for more papers by this author Thorsten Zenz Thorsten Zenz orcid.org/0000-0001-7890-9845 Heidelberg University Hospital, Heidelberg, Germany German Cancer Research Center (dkfz) and National Center for Tumor Diseases (NCT), Heidelberg, Germany Germany & Hematology, University Hospital Zurich and University of Zurich, Zurich, Switzerland Search for more papers by this author John C Marioni John C Marioni orcid.org/0000-0001-9092-0852 European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK Search for more papers by this author Florian Buettner Corresponding Author Florian Buettner [email protected] orcid.org/0000-0001-5587-6761 European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK Helmholtz Zentrum München–German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany Search for more papers by this author Wolfgang Huber Corresponding Author Wolfgang Huber [email protected] orcid.org/0000-0002-0474-2218 European Molecular Biology Laboratory (EMBL), Heidelberg, Germany Search for more papers by this author Oliver Stegle Corresponding Author Oliver Stegle [email protected] orcid.org/0000-0002-8818-7193 European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK European Molecular Biology Laboratory (EMBL), Heidelberg, Germany Search for more papers by this author Ricard Argelaguet Ricard Argelaguet orcid.org/0000-0003-3199-3722 European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK Search for more papers by this author Britta Velten Britta Velten orcid.org/0000-0002-8397-3515 European Molecular Biology Laboratory (EMBL), Heidelberg, Germany Search for more papers by this author Damien Arnol Damien Arnol orcid.org/0000-0003-2462-534X European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK Search for more papers by this author Sascha Dietrich Sascha Dietrich orcid.org/0000-0002-0648-1832 Heidelberg University Hospital, Heidelberg, Germany Search for more papers by this author Thorsten Zenz Thorsten Zenz orcid.org/0000-0001-7890-9845 Heidelberg University Hospital, Heidelberg, Germany German Cancer Research Center (dkfz) and National Center for Tumor Diseases (NCT), Heidelberg, Germany Germany & Hematology, University Hospital Zurich and University of Zurich, Zurich, Switzerland Search for more papers by this author John C Marioni John C Marioni orcid.org/0000-0001-9092-0852 European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK Search for more papers by this author Florian Buettner Corresponding Author Florian Buettner [email protected] orcid.org/0000-0001-5587-6761 European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK Helmholtz Zentrum München–German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany Search for more papers by this author Wolfgang Huber Corresponding Author Wolfgang Huber [email protected] orcid.org/0000-0002-0474-2218 European Molecular Biology Laboratory (EMBL), Heidelberg, Germany Search for more papers by this author Oliver Stegle Corresponding Author Oliver Stegle [email protected] orcid.org/0000-0002-8818-7193 European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK European Molecular Biology Laboratory (EMBL), Heidelberg, Germany Search for more papers by this author Author Information Ricard Argelaguet1,‡, Britta Velten2,‡, Damien Arnol1, Sascha Dietrich3, Thorsten Zenz3,4,5, John C Marioni1,6,7, Florian Buettner *,1,8, Wolfgang Huber *,2 and Oliver Stegle *,1,2 1European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK 2European Molecular Biology Laboratory (EMBL), Heidelberg, Germany 3Heidelberg University Hospital, Heidelberg, Germany 4German Cancer Research Center (dkfz) and National Center for Tumor Diseases (NCT), Heidelberg, Germany 5Germany & Hematology, University Hospital Zurich and University of Zurich, Zurich, Switzerland 6Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK 7Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK 8Helmholtz Zentrum München–German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany ‡These authors contributed equally to this work *Corresponding author. Tel: +49 89 23742560; E-mail: [email protected] *Corresponding author. Tel: +49 6221 387 8823; E-mail: [email protected] *Corresponding author. Tel: +49 6221 3878190; E-mail: [email protected] Molecular Systems Biology (2018)14:e8124https://doi.org/10.15252/msb.20178124 PDFDownload PDF of article text and main figures. Peer ReviewDownload a summary of the editorial decision process including editorial decision letters, reviewer comments and author responses to feedback. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info Abstract Multi-omics studies promise the improved characterization of biological processes across molecular layers. However, methods for the unsupervised integration of the resulting heterogeneous data sets are lacking. We present Multi-Omics Factor Analysis (MOFA), a computational method for discovering the principal sources of variation in multi-omics data sets. MOFA infers a set of (hidden) factors that capture biological and technical sources of variability. It disentangles axes of heterogeneity that are shared across multiple modalities and those specific to individual data modalities. The learnt factors enable a variety of downstream analyses, including identification of sample subgroups, data imputation and the detection of outlier samples. We applied MOFA to a cohort of 200 patient samples of chronic lymphocytic leukaemia, profiled for somatic mutations, RNA expression, DNA methylation and ex vivo drug responses. MOFA identified major dimensions of disease heterogeneity, including immunoglobulin heavy-chain variable region status, trisomy of chromosome 12 and previously underappreciated drivers, such as response to oxidative stress. In a second application, we used MOFA to analyse single-cell multi-omics data, identifying coordinated transcriptional and epigenetic changes along cell differentiation. Synopsis Multi-Omics Factor Analysis (MOFA) is a computational framework for unsupervised discovery of the principal axes of biological and technical variation when multiple omics assays are applied to the same samples. MOFA is a broadly applicable approach for multi-omics data integration. The inferred latent factors represent the underlying principal axes of heterogeneity across the samples. Factors can be shared by multiple data modalities or can be data-type specific. The model flexibly handles missing values and different data types. In an application to Chronic Lymphocytic Leukaemia, MOFA discovers a low dimensional space spanned by known clinical markers and underappreciated axes of variation such as oxidative stress. In an application to multi-omics profiles from single-cells, MOFA recovers differentiation trajectories and identifies coordinated variation between the transcriptome and the epigenome. Introduction Technological advances increasingly enable multiple biological layers to be probed in parallel, ranging from genome, epigenome, transcriptome, proteome and metabolome to phenome profiling (Hasin et al, 2017). Integrative analyses that use information across these data modalities promise to deliver more comprehensive insights into the biological systems under study. Motivated by this, multi-omics profiling is increasingly applied across biological domains, including cancer biology (Gerstung et al, 2015; Iorio et al, 2016; Mertins et al, 2016; Cancer Genome Atlas Research Network, 2017), regulatory genomics (Chen et al, 2016), microbiology (Kim et al, 2016) or host-pathogen biology (Soderholm et al, 2016). Most recent technological advances have also enabled performing multi-omics analyses at the single-cell level (Macaulay et al, 2015; Angermueller et al, 2016; Guo et al, 2017; Clark et al, 2018; Colomé-Tatché & Theis, 2018). A common aim of such applications is to characterize heterogeneity between samples, as manifested in one or several of the data modalities (Ritchie et al, 2015). Multi-omics profiling is particularly appealing if the relevant axes of variation are not known a priori, and hence may be missed by studies that consider a single data modality or targeted approaches. A basic strategy for the integration of omics data is testing for marginal associations between different data modalities. A prominent example is molecular quantitative trait locus mapping, where large numbers of association tests are performed between individual genetic variants and gene expression levels (GTEx Consortium, 2015) or epigenetic marks (Chen et al, 2016). While em-inently useful for variant annotation, such association studies are inherently local and do not provide a coherent global map of the molecular differences between samples. A second strategy is the use of kernel- or graph-based methods to combine different data types into a common similarity network between samples (Lanckriet et al, 2004; Wang et al, 2014); however, it is difficult to pinpoint the molecular determinants of the resulting graph structure. Related to this, there exist generalizations of other clustering methods to reconstruct discrete groups of samples based on multiple data modalities (Shen et al, 2009; Mo et al, 2013). A key challenge that is not sufficiently addressed by these approaches is interpretability. In particular, it would be desirable to reconstruct the underlying factors that drive the observed variation across samples. These could be continuous gradients, discrete clusters or combinations thereof. Such factors would help in establishing or explaining associations with external data such as phenotypes or clinical covariates. Although factor models that aim to address this have previously been proposed (e.g. Meng et al, 2014, 2016; Tenenhaus et al, 2014; preprint: Singh et al, 2018), these methods either lack sparsity, which can reduce interpretability, or require a substantial number of parameters to be determined using computationally demanding cross-validation or post hoc. Further challenges faced by existing methods are computational scalability to larger data sets, handling of missing values and non-Gaussian data modalities, such as binary readouts or count-based traits. Results We present Multi-Omics Factor Analysis (MOFA), a statistical method for integrating multiple modalities of omics data in an unsupervised fashion. Intuitively, MOFA can be viewed as a versatile and statistically rigorous generalization of principal component analysis (PCA) to multi-omics data. Given several data matrices with measurements of multiple omics data types on the same or on partially overlapping sets of samples, MOFA infers an interpretable low-dimensional data representation in terms of (hidden) factors (Fig 1A). These learnt factors capture major sources of variation across data modalities, thus facilitating the identification of continuous molecular gradients or discrete subgroups of samples. The inferred factor loadings can be sparse, thereby facilitating the linkage between the factors and the most relevant molecular features. Importantly, MOFA disentangles to what extent each factor is unique to a single data modality or is manifested in multiple modalities (Fig 1B), thereby revealing shared axes of variation between the different omics layers. Once trained, the model output can be used for a range of downstream analyses, including visualization, clustering and classification of samples in the low-dimensional space(s) spanned by the factors, as well as the automated annotation of factors using (gene set) enrichment analysis, the identification of outlier samples and the imputation of missing values (Fig 1B). Figure 1. Multi-Omics Factor Analysis: model overview and downstream analyses Model overview: MOFA takes M data matrices as input (Y1,…, YM), one or more from each data modality, with co-occurrent samples but features that are not necessarily related and that can differ in numbers. MOFA decomposes these matrices into a matrix of factors (Z) for each sample and M weight matrices, one for each data modality (W1,.., WM). White cells in the weight matrices correspond to zeros, i.e. inactive features, whereas the cross symbol in the data matrices denotes missing values. The fitted MOFA model can be queried for different downstream analyses, including (i) variance decomposition, assessing the proportion of variance explained by each factor in each data modality, (ii) semi-automated factor annotation based on the inspection of loadings and gene set enrichment analysis, (iii) visualization of the samples in the factor space and (iv) imputation of missing values, including missing assays. Download figure Download PowerPoint Technically, MOFA builds upon the statistical framework of group Factor Analysis (Virtanen et al, 2012; Khan et al, 2014; Klami et al, 2015; Bunte et al, 2016; Zhao et al, 2016; Leppäaho & Kaski, 2017), which we have adapted to the requirements of multi-omics studies (Materials and Methods): (i) fast inference based on a variational approximation, (ii) inference of sparse solutions facilitating interpretation, (iii) efficient handling of missing values and (iv) flexible combination of different likelihood models for each data modality, which enables integrating diverse data types such as binary-, count- and continuous-valued data. The relationship of MOFA to previous approaches (Shen et al, 2009; Virtanen et al, 2012; Mo et al, 2013; Klami et al, 2015; Remes et al, 2015; Bunte et al, 2016; Hore et al, 2016; Zhao et al, 2016; Leppáaho & Kaski, 2017) is discussed in Materials and Methods and Appendix Table S3. MOFA is implemented as well-documented open-source software and comes with tutorials and example workflows for different application domains (Materials and Methods). Taken together, these functionalities provide a powerful and versatile tool for disentangling sources of variation in multi-omics studies. Model validation and comparison on simulated data First, to validate MOFA, we simulated data from its generative model, varying the number of views, the likelihood models, the number of latent factors and other parameters (Materials and Methods, Appendix Table S1). We found that MOFA was able to accurately reconstruct the latent dimension, except in settings with large numbers of factors or high proportions of missing values (Appendix Fig S1). We also found that models that account for non-Gaussian observations improved the fit when simulating binary or count data (Appendix Figs S2 and S3). We also compared MOFA to two previously reported latent variable models for multi-omics integration: GFA (Leppäaho & Kaski, 2017) and iCluster (Mo et al, 2013). Over a range of simulations, we observed that GFA and iCluster tended to infer redundant factors (Appendix Fig S4) and were less accurate in recovering patterns of shared factor activity across views (Appendix Fig S5). MOFA is also computationally more efficient than these existing methods (Fig EV1). For example, the training on the CLL data, which we consider next, required 25 min using MOFA versus 34 h with GFA and 5–6 days with iCluster. Click here to expand this figure. Figure EV1. Scalability of MOFA, GFA and iClusterTime required for model training for GFA (red), MOFA (blue) and iCluster (green) as a function of number of factors K, number of features D, number of samples N and number of views M. Baseline parameters were M = 3, K = 10, D = 1,000 and N = 100 and 5% missing values. Shown are average time across 10 trials, and error bars denote standard deviation. iCluster is only shown for the lowest M as all other settings require on average more than 200 min for training. Download figure Download PowerPoint Application to chronic lymphocytic leukaemia We applied MOFA to a study of chronic lymphocytic leukaemia (CLL), which combined ex vivo drug response measurements with somatic mutation status, transcriptome profiling and DNA methylation assays (Dietrich et al, 2018; Fig 2A). Notably, nearly 40% of the 200 samples were profiled with some but not all omics types; such a missing value scenario is not uncommon in large cohort studies, and MOFA is designed to cope with it (Materials and Methods; Appendix Fig S1). MOFA was configured to combine different likelihood models in order to accommodate the combination of continuous and discrete data types in this study. Figure 2. Application of MOFA to a study of chronic lymphocytic leukaemia A. Study overview and data types. Data modalities are shown in different rows (D = number of features) and samples (N) in columns, with missing samples shown using grey bars. B, C. (B) Proportion of total variance explained (R2) by individual factors for each assay and (C) cumulative proportion of total variance explained. D. Absolute loadings of the top features of Factors 1 and 2 in the Mutations data. E. Visualization of samples using Factors 1 and 2. The colours denote the IGHV status of the tumours; symbol shape and colour tone indicate chromosome 12 trisomy status. F. Number of enriched Reactome gene sets per factor based on the gene expression data (FDR < 1%). The colours denote categories of related pathways defined as in Appendix Table S2. Download figure Download PowerPoint MOFA identified 10 factors (minimum explained variance 2% in at least one data type; Materials and Methods). These were robust to algorithm initialization as well as subsampling of the data (Appendix Figs S6 and S7). The factors were largely orthogonal, capturing independent sources of variation (Appendix Fig S6). Among these, Factors 1 and 2 were active in most assays, indicating broad roles in multiple molecular layers (Fig 2B). In contrast, other factors such as Factor 3 or Factor 5 were specific to two data modalities, and Factor 4 was active in a single data modality only. Cumulatively, the 10 factors explained 41% of variation in the drug response data, 38% in the mRNA data, 24% in the DNA methylation data and 24% in the mutation data (Fig 2C). We also trained MOFA when excluding individual data modalities to probe their redundancy, finding that factors that were active in multiple data modalities could still be recovered, while the identification of others was dependent on a specific data type (Appendix Fig S8). In comparison with GFA (Leppäaho & Kaski, 2017) and iCluster (Mo et al, 2013), MOFA was more consistent in identifying factors across multiple model instances (Appendix Fig S9). MOFA identifies important clinical markers in CLL and reveals an underappreciated axis of variation attributed to oxidative stress As part of the downstream pipeline, MOFA provides different strategies to use the loadings of the features on each factor to identify their aetiology (Fig 1B). For example, based on the top weights in the mutation data, Factor 1 was aligned with the somatic mutation status of the immunoglobulin heavy-chain variable region gene (IGHV), while Factor 2 aligned with trisomy of chromosome 12 (Fig 2D and E). Thus, MOFA correctly identified two major axes of molecular disease heterogeneity and aligned them with two of the most important clinical markers in CLL (Zenz et al, 2010; Fabbri & Dalla-Favera, 2016; Fig 2D and E). IGHV status, the marker associated with Factor 1, is a surrogate of the differentiation state of the tumour's cell of origin and the level of activation of the B-cell receptor. While in clinical practice this axis of variation is generally considered binary (Fabbri & Dalla-Favera, 2016), our results indicate a more complex substructure (Fig 3A, Appendix Fig S10). At the current resolution, this factor was consistent with three subgroup models such as proposed by Oakes et al (2016) and Queiros et al (2015) (Appendix Fig S11), although there is suggestive evidence for an underlying continuum. MOFA connected this factor to multiple molecular layers (Appendix Figs S12 and S13), including changes in the expression of genes previously linked to IGHV status (Vasconcelos et al, 2005; Maloum et al, 2009; Trojani et al, 2012; Morabito et al, 2015; Plesingerova et al, 2017; Fig 3B and C) and with drugs that target kinases in or downstream of the B-cell receptor pathway (Fig 3D and E). Figure 3. Characterization of the inferred factor associated with the differentiation state of the cell of origin Beeswarm plot with Factor 1 values for each sample with colours corresponding to three groups found by 3-means clustering with low factor values (LZ), intermediate factor values (IZ) and high factor values (HZ). Absolute loadings for the genes with the largest absolute weights in the mRNA data. Plus or minus symbols on the right indicate the sign of the loading. Genes highlighted in orange were previously described as prognostic markers in CLL and associated with IGHV status (Vasconcelos et al, 2005; Maloum et al, 2009; Trojani et al, 2012; Morabito et al, 2015; Plesingerova et al, 2017). Heatmap of gene expression values for genes with the largest weights as in (B). Absolute loadings of the drugs with the largest weights, annotated by target category. Drug response curves for two of the drugs with top weights, stratified by the clusters as in (A). Download figure Download PowerPoint Despite their clinical importance, the IGHV and the trisomy 12 factors accounted for < 20% of the variance explained by MOFA, suggesting the existence of other sources of heterogeneity. One example is Factor 5, which was active in the mRNA and drug response data. Analysis of the weights in the mRNA revealed that this factor tagged a set of genes enriched for oxidative stress and senescence pathways (Figs 2F and EV2A), with the top weights corresponding to heat-shock proteins (HSPs; Fig EV2B and C), genes that are essential for protein folding and are up-regulated upon stress conditions (Srivastava, 2002; Åkerfelt et al, 2010). Although genes in HSP pathways are up-regulated in some cancers and have known roles in tumour cell survival (Trachootham et al, 2009), thus far this gene family has received little attention in the context of CLL. Consistent with this annotation based on the mRNA data, we observed that the drugs with the strongest weights on Factor 5 were associated with response to oxidative stress, such as target reactive oxygen species (ROS), DNA damage response and apoptosis (Fig EV2D and E). Click here to expand this figure. Figure EV2. Characterization of Factor 5 (oxidative stress response factor) in the CLL data Beeswarm plot of Factor 5. Colours denote the expression of TNF, an inflammatory stress marker. Gene set enrichment analysis for the top Reactome pathways in the mRNA data (t-test, Materials and Methods). Heatmap of gene expression values for the six genes with largest loading. Samples are ordered by their factor values. Scaled loadings for the top drugs with the largest loading, annotated by target category. Heatmap of drug response values for the top three drugs with largest loading. Download figure Download PowerPoint Factor 4 captured 9% of variation in the mRNA data, and gene set enrichment analysis on the mRNA loadings suggested aetiologies related to immune response pathways and T-cell receptor signalling (Fig 2F), likely due to differences in cell type composition between samples: While the samples are comprised mainly of B cells, Factor 4 revealed a possible contamination with other cell types such as T cells and monocytes (Appendix Fig S14). Factor 3 explained 11% of variation in the drug response data capturing differences in the samples' general level of drug sensitivity (Geeleher et al, 2016; Appendix Fig S15). MOFA identifies outlier samples and accurately imputes missing values Next, we explored the relationship between inferred factors and clinical annotations, which can be missing, mis-annotated or inaccurate, since they are frequently based on single markers or imperfect surrogates (Westra et al, 2011). Since IGHV status is the major biomarker impacting on clinical care, we assessed the consistency between the inferred continuous Factor 1 and this binary marker. For 176 out of 200 patients, the MOFA factor was in agreement with the clinical IGHV status, and MOFA further allowed for classifying 12 patients that lacked clinically measured IGHV status (Fig EV3A and B). Interestingly, MOFA assigned 12 patients to a different group than suggested by their clinical IGHV label. Upon inspection of the underlying molecular data, nine of these cases showed intermediate molecular signatures, suggesting that they are borderline cases that are not well captured by the binary classification; the remaining three cases were clearly discordant (Fig EV3C and D). Additional independent drug response assays as well as whole exome sequencing data confirmed that these cases are outliers within their IGHV group (Fig EV3E and F). Click here to expand this figure. Figure EV3. Prediction of IGHV status based on Factor 1 in the CLL data and validation on outlier cases on independent assays Beeswarm plot of Factor 1 with colours denoting agreement between predicted and clinical labels as in (B). Pie chart showing total numbers for agreement of imputed labels with clinical label. Sample-to-sample correlation matrix based on drug response data. Sample-to-sample correlation matrix based on methylation data. Drug response to ONO-4509 (not included in the training data): Boxplots for the viability values in response to ONO-4509. The three outlier samples are shown in the middle; on the left and right, the viabilities of the other M-CLL and U-CLL samples are shown, respectively. The panels show different drug concentrations tested. Boxes represent the first and third quartiles of the values for M-CLL and U-CLL samples, for individual patients the single value. Whole exome sequencing data on IGHV genes (not included in the training data): the number of mutations found on IGHV genes using whole exome sequencing is shown on the y-axis, separately for U-CLL and M-CLL samples. The three outlier samples are labelled. Download figure Download PowerPoint As incomplete data is a common problem in studies that combine multiple high-throughput assays, we assessed the ability of MOFA to fill in missing values within assays as well as when entire data modalities are missing for some of the samples. For both imputation tasks, MOFA yielded more accurate predictions than other established imputation strategies, including imputation by feature-wise mean, SoftImpute (Mazumder et al, 2010) and a k-nearest neighbour method (Troyanskaya et al, 2001; Fig EV4, Appendix Fig S16), and MOFA was more robust than GFA, especially in the case of missing assays (Appendix Fig S17). Click here to expand this figure. Figure EV4. Imputation of missing values in the drug response assay of the CLL data A, B. Considered were MOFA, SoftImpute, imputation by feature-wise mean (Mean) and k-nearest neighbour (kNN). Shown are averages of the mean squared error (MSE) across 15 imputation experiments for increasing fractions of missing data, considering (A) values missing at random and (B) entire assay missing for samples at random. Error bars denote plus or minus two standard error. Download figure Download PowerPoint Latent factors inferred by MOFA are predictive of clinical outcomes Finally, we explored the utility of the latent factors inferred by MOFA as predictors in models of clinical outcomes. Three of the 10 factors identified by MOFA were significantly associated with time to next treatment (Cox regression, Materials and Methods, FDR < 1%, Fig 4A and B): Factor 1, related to the B-cell of origin, and two Factors, 7 and 8, associated with chemo-immunotherapy treatment prior to sample collection (P < 0.01, t-test). In particular, Factor 7 captures del17p and TP53 mutations as well as differences in methylation patterns of oncogenes (Garg et al, 2014; Fluhr et al, 2016; Appendix Fig S18), while Factor 8 is associated with WNT signalling (Appendix Fig S19). Figure 4. Relationship between clinical data and latent factors Association of MOFA factors to time to next treatment using a univariate Cox regression with N = 174 samples (96 of