Article Figures & Data

Figures

Model overview: MOFA takes M data matrices as input (Y1,…, YM), one or more from each data modality, with co‐occurrent samples but features that are not necessarily related and that can differ in numbers. MOFA decomposes these matrices into a matrix of factors (Z) for each sample and M weight matrices, one for each data modality (W1,.., WM). White cells in the weight matrices correspond to zeros, i.e. inactive features, whereas the cross symbol in the data matrices denotes missing values.

The fitted MOFA model can be queried for different downstream analyses, including (i) variance decomposition, assessing the proportion of variance explained by each factor in each data modality, (ii) semi‐automated factor annotation based on the inspection of loadings and gene set enrichment analysis, (iii) visualization of the samples in the factor space and (iv) imputation of missing values, including missing assays.

Time required for model training for GFA (red), MOFA (blue) and iCluster (green) as a function of number of factors K, number of features D, number of samples N and number of views M. Baseline parameters were M = 3, K = 10, D = 1,000 and N = 100 and 5% missing values. Shown are average time across 10 trials, and error bars denote standard deviation. iCluster is only shown for the lowest M as all other settings require on average more than 200 min for training.

Figure 3.Characterization of the inferred factor associated with the differentiation state of the cell of origin

Beeswarm plot with Factor 1 values for each sample with colours corresponding to three groups found by 3‐means clustering with low factor values (LZ), intermediate factor values (IZ) and high factor values (HZ).

Absolute loadings for the genes with the largest absolute weights in the mRNA data. Plus or minus symbols on the right indicate the sign of the loading. Genes highlighted in orange were previously described as prognostic markers in CLL and associated with IGHV status (Vasconcelos et al, 2005; Maloum et al, 2009; Trojani et al, 2012; Morabito et al, 2015; Plesingerova et al, 2017).

Heatmap of gene expression values for genes with the largest weights as in (B).

Absolute loadings of the drugs with the largest weights, annotated by target category.

Drug response curves for two of the drugs with top weights, stratified by the clusters as in (A).

Figure EV3.Prediction of IGHV status based on Factor 1 in the CLL data and validation on outlier cases on independent assays

Beeswarm plot of Factor 1 with colours denoting agreement between predicted and clinical labels as in (B).

Pie chart showing total numbers for agreement of imputed labels with clinical label.

Sample‐to‐sample correlation matrix based on drug response data.

Sample‐to‐sample correlation matrix based on methylation data.

Drug response to ONO‐4509 (not included in the training data): Boxplots for the viability values in response to ONO‐4509. The three outlier samples are shown in the middle; on the left and right, the viabilities of the other M‐CLL and U‐CLL samples are shown, respectively. The panels show different drug concentrations tested. Boxes represent the first and third quartiles of the values for M‐CLL and U‐CLL samples, for individual patients the single value.

Whole exome sequencing data on IGHV genes (not included in the training data): the number of mutations found on IGHV genes using whole exome sequencing is shown on the y‐axis, separately for U‐CLL and M‐CLL samples. The three outlier samples are labelled.

Association of MOFA factors to time to next treatment using a univariate Cox regression with N = 174 samples (96 of which are uncensored cases) and P‐values based on the Wald statistic. Error bars denote 95% confidence intervals. Numbers on the right denote P‐values for each predictor.

Kaplan–Meier plots measuring time to next treatment for the individual MOFA factors. The cut‐points on each factor were chosen using maximally selected rank statistics (Hothorn & Lausen, 2003), and P‐values were calculated using a log‐rank test on the resulting groups.

Prediction accuracy of time to treatment for N = 174 patients using multivariate Cox regression trained using the 10 factors derived using MOFA, as well using the first 10 components obtained from PCA applied to the corresponding single data modalities and the full data set (assessed on hold‐out data). Shown are average values of Harrell's C‐index from fivefold cross‐validation. Error bars denote standard error of the mean.