# *padma* package: Quick-start guide In padma: Individualized Multi-Omic Pathway Deviation Scores Using Multiple Factor Analysis

knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )  knitr::include_graphics("hex_padma_v2.png")  padma is a package that uses multiple factor analysis to calculate individualized pathway-centric scores of deviation with respect to the sampled population based on multi-omic assays (e.g., RNA-seq, copy number alterations, methylation, etc). In particular, padma characterizes individuals with aberrant multi-omic profiles for a given pathway of interest and quantifies this deviation with respect to the sampled population using a multi-omic consensus representation. Graphical and numerical outputs are provided to identify highly aberrant individuals for a particular pathway of interest, as well as the gene and omics drivers of aberrant multi-omic profiles. The method implemented in this package is described in greater detail in the following pre-print: Rau, A., Manansala, R., Flister, M. J., Rui, H., Jaffrézic, F., Laloë, D.*, and Auer, P. L.* (2019) Individualized multi-omic pathway deviation scores using multiple factor analysis bioRxiv, https://doi.org/10.1101/827022. *These authors contributed equally to this work. Below, we provide a quick-start guide using a subset of the TCGA LUAD multi-omic data (included with the padma package) to illustrate the functionalities and output of the method. # Quick start (tl;dr) As with any R package, after installation the padma package is loaded as follows: library(padma)  Multi-omic data must be formatted as either a MultiAssayExperiment object (preferred) or as a named list of data.frames of matched multi-omic data from a set of individuals, which is internally converted into a MultiAssayExperiment object. A small example dataset in list format is provided in LUAD_subset for multi-omic data (clinical, RNA-seq, CNA, methylation, miRNA-seq) on 13 genes from the D4-GDI signaling pathway in individuals with lung adenocarcinoma from TCGA. In practice, multi-omic data do not need to be subsetted prior to running padma; this reduced dataset is simply used as a minimal illustrative example. A standard padma analysis for the D4-GDI signalling pathway takes as input these pre-formatted MultiAssayExperiment data (called \code{mae} below) and the name of the desired pathway. Note that data should be normalized, batch-corrected, and if appropriate, log-transformed prior to using padma. Built-in pathway names and gene lists from the MSigDB curated pathway set are provided in the msigdb data provided with padma. run_padma <- padma(mae, pathway_name = "c2_cp_BIOCARTA_D4GDI_PATHWAY")  Individualized pathway deviation scores can then be accessed via the assay accessor function, and per-gene contributions to these scores can be accessed via the gene_deviation_scores() accessor. In addition, a variety of plots can be obtained, including a factor map or partial factor map of individuals, and a plot of the percentage contribution to each component by each considered omics. assay(run_padma) factorMap(run_padma) factorMap(run_padma, partial_id = "TCGA-78-7536") omicsContrib(run_padma)  Additional numerical outputs included in the run_padma S4 object of class padmaResults include the per-gene pathway deviation scores and full results from the MFA analysis. See the following section for a more detailed description of these outputs. # Description of padma A portion of the following text is taken from Rau et al. (2019). ## Overview of Multiple Factor Analysis Multiple factor analysis (MFA) represents an extension of principal component analysis for the case where multiple quantitative data tables are to be simultaneously analyzed. In particular, the MFA approach weights each table individually to ensure that tables with more features or those on a different scale do not dominate the analysis; all features within a given table are given the same weight. These weights are chosen such that the first eigenvalue of a PCA performed on each weighted table is equal to 1, ensuring that all tables play an equal role in the global multi-table analysis. According to the desired focus of the analysis, data can be structured either with molecular assays (e.g., RNA-seq, methylation, miRNA-seq, copy number alterations) as tables (and genes as features within omics), or with genes as tables (and molecular assays as features within genes). The MFA weights balance the contributions of each omic or of each gene, respectively. In padma, we focus on the latter strategy in order to allow different omics to contribute to a varying degree depending on the chosen pathway. More precisely, consider a pathway or gene set composed of$p$genes (Figure 1A), each of which is measured using up to$k$molecular assays (e.g., RNA-seq, methylation, miRNA-seq, copy number alterations), contained in the set of gene-specific matrices$X_1,\ldots, X_p$that have the same$n$matched individuals (rows) and$j_1,\ldots, j_p$potentially unmatched variables (columns) in each, where$j_g \in \lbrace 1, \ldots, k\rbrace$for each gene$g = 1,\ldots, p$. Because only the observations and not the variables are matched across data tables, genes may be represented by potentially different subset of omics data (e.g., only expression data for one gene, and expression and methylation data for another). In the first step, these data tables are generally standardized (i.e., centered and scaled). Next, an individual PCA is performed using singular value decomposition for each gene table$X_g$, and its largest singular value$\lambda_g^1$is calculated (Figure 1B). Then, all features in each gene table$X_g$are weighted by$1/\lambda_g^1$, and a global PCA is performed using a singular value decomposition on the concatenated set of weighted standardized tables,$X^\ast = \left[ \frac{X_1}{\lambda_1^1}, \ldots, \frac{X_p}{\lambda_p^1}\right]$(Figure 1C). This yields a matrix of components (i.e., latent variables) in the observation and variable space. The MFA thus provides a consensus across-gene representation of the individuals for a given pathway, and the global PCA performed on the weighted gene tables decomposes the consensus variance into orthogonal variables (i.e., principal components) that are ordered by the proportion of variance explained by each. The coordinates of each individual on these components, also referred to as factor scores, can be used to produce factor maps to represent individuals in this consensus space such that smaller distances reflect greater similarities among individuals. In addition, partial factor scores, which represent the position of individuals in the consensus for a given gene, can also be represented in the consensus factor map; the average of partial factor scores across all dimensions and genes for a given individual corresponds to the factor score (Figure 1D). ## Calculation of individualized pathway deviation scores In the consensus space obtained from the MFA, the origin represents the average" pathway behavior across genes, omics, and individuals; individuals that are projected to increasingly distant points in the factor map represent those with increasingly aberrant values, with respect to this average, for one or more of the omics measures for one or more genes in the pathway. To quantify these aberrant individuals, we propose an individualized pathway deviation score$d_i$based on the multidimensional Euclidean distance of the MFA component loadings for each individual to the origin: \begin{equation} d_i^2 = \sum_{\ell = 1}^L f_{i,\ell}^2, \end{equation } where$f_{i,\ell}$corresponds to the MFA factor score of individual$i$in component$\ell$, and$L$corresponds to the rank of$X^\ast$. Note that this corresponds to the weighted Euclidean distance of the scaled multi-omic data (for the genes in a given pathway) of each individual to the origin. These individualized pathway deviation scores are thus nonnegative, where smaller values represent individuals for whom the average multi-omic pathway variation is close to the average, while larger scores represent individuals with increasingly aberrant multi-omic pathway variation with respect to the average. An individual with a large pathway deviation score is thus characterized by one or more genes, with one or more omic measures, that explain a large proportion of the global correlated information across the full pathway. ## Calculation of individualized per-gene deviation scores In order to quantify the role played by each gene for each individual, we decompose the individualized pathway deviation scores into gene-level contributions. Recall that the average of partial factor scores across all MFA dimensions corresponds to each individual’s factor score. We define the gene-level deviation for a given individual as follows: \begin{equation} d_{i,g}=\frac{\sum_{\ell = 1}^L f_{i,\ell}\left(f_{i,\ell,g}-f_{i,\ell}\right)} {\sum_{\ell = 1}^L f_{i,\ell}^2}, \end{equation} where as before$f_{i,\ell}$corresponds to the MFA factor score of individual$i$in component$\ell$,$L$corresponds to the rank of$X^\ast$, and$f_{i,\ell,g}$corresponds to the MFA partial factor score of individual$i$in gene$g$in component$\ell$. Note that by construction, the contributions of all pathway genes to the overall deviation score sum to 0. In particular, per-gene contributions can take on both negative and positive values according to the extent to which the gene influences the deviation of the overall pathway score from the origin (i.e., the global center of gravity across individuals); large positive values correspond to tables with a large influence on the overall deviation of an individual, while large negative values correspond to genes that tend to be most similar to the global average. # Description of built-in datasets ## LUAD_subset: D4-GDI signaling pathway in the TCGA-LUAD multi-omic data In this vignette, we focus on the example of multi-omic (RNA-seq, methylation, copy number alterations, and miRNA-seq) data for the D4-GDI signaling pathway from individuals with lung adenocarcinoma (LUAD) from The Cancer Genome Atlas (TCGA) database. The multi-omic TCGA data were downloaded and processed as previously described, including batch correction for the plate effect. The total number of individuals considered here is$n=144$. See Rau et al. (2019) for additional details about data processing and mapping of miRNAs to genes. The D4-GDP dissociation inhibitor (GDI) signaling pathway is made up of 13 genes; it was chosen for follow-up here as it was identified as having individualized pathway deviation scores that are significantly positively correlated with smaller progression-free intervals. RNA-seq, methylation, and CNA measures are available for all 13 genes, with the exception of CYCS and PARP1, for which no methylation probes were measured the promoter region. In addition, miRNA-seq data were included for one predicted target pair: hsa-mir-421$\rightarrow$CASP3. The available multi-omic data for the D4-GDI signaling pathway in TCGA LUAD patients is included within the padma package as the LUAD_subset object, which is a named list of data.frames each containing the relevant data (clinical parameters, RNA-seq, methylation, miRNA-seq, CNA) for the$p=13$genes of the D4-GDI pathway in the$n=144$individuals. names(LUAD_subset) lapply(LUAD_subset, class) lapply(LUAD_subset, dim)  There are some important data formatting issues that are worth mentioning here: • The clinical data (LUAD_subset$clinical) have individuals as rows and clinical variables as columns; all other datasets contain assay data with biological entities (e.g., genes, miRNAs) as rows and individuals as columns, with appropriate row and column names. Sample (column) names for assay data correspond to the patient barcodes that are provided in the bcr_patient_barcode column of LUAD_subset$clinical. For LUAD_subset$mirna, the first two columns represent the miRNA ID (miRNA_lc) and corresponding targeted gene symbol (gene).

• A gene_map data.frame can be optionally included to map biological entities (e.g. miRNAs) to gene symbols as needed. By default, a set of 10,754 predicted miRNA gene targets with "Functional MTI" support type from miRTarBase (version 7.0) are used for the gene_map.

• All omics data have already been appropriately pre-processed before subsetting to this pathway (e.g., RNA-seq and miRNA-seq data have been normalized and log-transformed, all datasets were batch-corrected for the plate effect).

• Although padma does require that all datasets have the same number of individuals, and that all datasets are sorted so that individuals are in the same order, each dataset does not need to represent every considered gene. Note that in these data, a single gene (CASP3) is predicted to be targeted by a miRNA (hsa-mir-421).

• Finally, these toy data have been subsetted to include only values relevant to the D4-GDI signaling pathway solely in the interest of space. In practice, users do not need to subset their data prior to input in padma or match miRNAs to genes, as both of these steps are performed automatically for a ser-provided pathway name or set of genes.

## msigdb: the MSigDB canonical pathway gene sets

In our work, we considered the pathways included in the MSigDB canonical
pathways curated gene set catalog, which includes genes whose products are involved in metabolic and signaling pathways reported in curated public databases. We specifically used the 1322 “C2 curated gene sets” catalog from 601MSigDB v5.2 available at http://bioinf.wehi.edu.au/software/MSigDB/as described in the limma Bioconductor package. For convenience, the gene sets corresponding to each pathway have been included in the msigdb data frame:

head(msigdb)


In practice, to use padma a user can either directly provide a vector of gene symbols, or the geneset name as defined in this data frame (e.g., c2_cp_BIOCARTA_41BB_PATHWAY).

## mirtarbase: predicted gene targets of miRNAs

padma integrates multi-omic data by mapping omics measures to genes in a given pathway. Although this assignment of values to genes is relatively straightforward for RNA-seq, CNA, and methylation data, a definitive mapping of miRNA-to-gene relationships does not exist, as miRNAs can each potentially target multiple genes. Many methods and databases based on text-mining or bioinformatics-driven approaches exist to predict miRNA-target pairs. Here, we make use of the curated miR-target interaction (MTI) predictions in miRTarBase (version 7.0),
using only exact matches for miRNA IDs and target gene symbols and predictions with the “Functional MTI” support type. As a reference, we have included these predictions in the built-in mirtarbase data frame:

head(mirtarbase)


In practice, unless an alternative data frame is provided for the gene_map argument of padma, the mirtarbase data frame is used to match miRNA IDs to gene symbols.

To run padma for the D4-GDI pathway, we start by formatting the LUAD_subset data as a MultiAssayExperiment object.

library(MultiAssayExperiment)

omics_data <-
list(rnaseq = LUAD_subset$rnaseq, methyl = LUAD_subset$methyl,
mirna = LUAD_subset$mirna, cna = LUAD_subset$cna)
pheno_data <-
data.frame(LUAD_subset$clinical, row.names = LUAD_subset$clinical$bcr_patient_barcode) mae <- suppressMessages( MultiAssayExperiment::MultiAssayExperiment( experiments = omics_data, colData = pheno_data))  Next, the following command is used: D4GDI <- msigdb[grep("D4GDI", msigdb$geneset), "geneset"]
padma(mae, pathway_name = D4GDI, verbose = FALSE)


Alternatively, padma could be run directly on the list of data.frames contained in LUAD_subset (ensuring that the clinical data are included as colData):

clinical_data <- data.frame(LUAD_subset$clinical) rownames(clinical_data) <- clinical_data$bcr_patient_barcode

colata = clinical_data,
pathway_name = D4GDI,
verbose = FALSE)


Note that we have used the misgdb built-in database to find the full name of the D4-GDI pathway, "c2_cp_BIOCARTA_D4GDI_PATHWAY". Alternatively, a vector of gene symbols for the pathway of interest could instead be provided to the pathway_name argument:

D4GDI_genes <- unlist(strsplit(
msigdb[grep("D4GDI", msigdb$geneset), "symbol"], ", ")) D4GDI_genes run_padma_again <- padma(mae, pathway_name = D4GDI_genes, verbose = FALSE)  As noted above, it is required to provide either a recognized pathway name (see msigdb$geneset) or a vector of gene symbols. In most cases, users would provide the full multi-omics data in the omics_data argument of padma, rather than a pre-subsetted dataset as is the case for the built-in LUAD_subset data.

In the following, we detail the various numerical and graphical outputs that an be accessed after running padma; note that the results are an S4 object of class "padmaResults", an extension of the RangedSummarizedExperiment class.

## Numerical outputs

The numerical outputs of padma include the following (which can all be accessed using the matching accessor functions):

There are two primary numerical outputs from padma:

• pathway deviation scores: accessed via assay(...),
a data.frame providing the individualized pathway deviation scores for each individual. The group column indicates whether an individual was included as part of the reference population ("Base") or as a supplementary individual projected onto the reference ("Supp"), as indicated by the user in the base_ids and supp_ids arguments of padma.
• pathway gene-specific deviation scores: accessed via pathway_gene_deviation(...), a matrix of dimension $n$ (nuumber of individuals) x $p$ (number of genes in the pathway) containing the individualized per-gene deviation scores

There are also several outputs related to the MFA itself, and obtained via the FactorMineR R package implementation of the MFA:

• MFA_results: accessed via the MFA_results(...), a named list containing detailed MFA output. When padma is run with full_results = TRUE, these include eigenvalues (eig), the percent contributions of each individual, gene, and omic to the MFA components (ind_contrib_MFA, gene_contrib_MFA, omics_contrib_MFA), the formatted gene tables used as input for the MFA (gene_tables), the pairwise Lg coefficients between genes (gene_Lf_MFA), and the full FactoMineR MFA output (total_MFA), which is an object of class "MFA". See the FactoMineR reference manual for a full description of this full MFA output. When padma is run with full_results = FALSE, these include eigenvalues (eig), and a summary of the cumulative contributions of each individual, gene, and omic to the first ten and full set of MFA components (ind_contrib_MFA_summary, gene_contrib_MFA_summary, omics_contrib_MFA_summary). In both cases, eig is a matrix of dimension $c$ (number of MFA components) x 3 providing the respective eigenvalues, percentage of variance explained, and cumulative percentage of variance explained by each MFA component.

Finally, there are additional outputs related to the pathway itself that can be accessed via appropriate accessor functions:

• pathway_name: accessed via pathway_name(...), the user-provided name of pathway analyzed by padma. If a vector of gene symbols was provided rather than a built-in name, "custom" is returned.
• ngenes: accessed via ngenes(...), number of genes with data available in the pathway
• imputed_genes: accessed via imputed_genes(...), names of entities for which values were automatically imputed by padma
• removed_genes: accessed via removed_genes(...), names of entities which were automatically removed by padma due to small variance (< 10e-5) or all missing values across samples.

## Graphical outputs

padma includes three primary plotting functions, all of which can be produced using ggplot2 (default) or base graphics (using argument ggplot = FALSE). Note that when ggplot2 is used, ggrepel is additionally used to better position sample labels.

• factorMap: plot the MFA components for each individual for a selected pair of components. This would typically be done for the first few components, and serves as a visualization of the structuring of indviduals for the pathway; for example, below the factor map suggests that individual TCGA-78-7536 is likely to have a larger overall pathway deviation score given that it is positioned quite far from the origin (for these two axes).
factorMap(run_padma, dim_x = 1, dim_y = 2)


The base version of this plot may be produced as follows.

factorMap(run_padma, dim_x = 1, dim_y = 2, ggplot = FALSE)

• factorMap (partial mode): plot the MFA partial components for one specific individual for a pair of components. This function would be used to zero in on one or several indivduals (for example, TCGA-78-7536) to visualize which genes appear to be pulling the overall score away from the origin. Here we note the seeming large influence of CASP1 and CASP3. Note that the individual's overall pathway score, which is represented as the large black point, is the same as that from the previous plot.
factorMap(run_padma,
partial_id = "TCGA-78-7536",
dim_x = 1, dim_y = 2)


As before, ggplot2 plotting can be optionally turned off, if desired.

factorMap(run_padma,
partial_id = "TCGA-78-7536",
dim_x = 1, dim_y = 2, ggplot = FALSE)

• omicsContrib: visualize the contribution of each omics to the variation of a subset of MFA components, and summarize this contribution as a weighted overall average. In this plot, transparency of the bars reflects the percentage variance explained for each component. Note that if ggplot2 is used, the cowplot package is used to combine graphics.
omicsContrib(run_padma, max_dim = 10)


Once again, ggplot2 plotting can be turned off if desired.

omicsContrib(run_padma, max_dim = 10, ggplot = FALSE)


### Projecting supplementary individuals onto a reference consensus

Unless otherwise specified, padma will make use of all individuals to calculate pathway deviation scores. However, in some cases users may wish to perform the MFA on a set of reference individuals and subsequently project a set a supplementary individuals onto the reference consensus (e.g., in a case where healthy individuals are to be used as a reference, and pathway deviation scores of diseased individuals are to be calculated relative to them).

In such a case, the base_ids and supp_ids arguments of padma can be used to provide the sampel names for individuals in each group, respectively. For instance, if the first ten individuals are in the reference group, and the user wishes to project individuals 15 to 20 as supplementary, the following code could be used:

run_padma_supp <-
padma(mae, pathway_name = D4GDI, verbose = FALSE,
base_ids = sampleMap(mae)$primary[1:10], supp_ids = sampleMap(mae)$primary[15:20])


Note that there must not be any overlap in the indices provided for base_ids and supp_ids.

### Missing data imputation

When missing values are included in the omics_data argument, by default padma uses mean imputation to replace these values prior to calculating individualized pathway scores. Alternatively, these values can be imputed via a preliminary MFA as implemented in the [missMDA}(https://cran.r-project.org/web/packages/missMDA/index.html) package, which can in some cases be slightly more time consuming. To do, the impute argument must be activated:

run_padma_impute <-
impute = TRUE, verbose = FALSE)


### Providing miRNA target predictions

As noted above, if miRNA-seq data (or other data in which biological entities are not genes) are included in the omics_data argument, a mapping to genes must be provided to padma in the gene_map argument. By default, the mapping in mirtarbase taken from miRTarBase is used within padma. If another mapping is desired by the user, this should be provided as a 2-column data frame (entity name and corresponding gene) in the gene_map argument.

### Requesting concise results from padma

Because padma stores the full set of MFA numerical results above (including factor components and partial factor components for all individuals), the full output can become quite large if run for very large pathways and/or looped over a large number of pathways. By default, padma returns this full set of results, but users can request a space-saving concise output via the full_results argument:

run_padma_concise <-
padma(mae, pathway_name = D4GDI, full_results = FALSE)


# FAQ

1. Can padma be used with non-continuous data, like binary per-gene somatic mutation calls?

At the current time, padma is only formatted to correctly handle continuous data. However, the MFA approach (as implemented in the FactoMineR package, which is called by padma) is technically able to handle continuous, categorical, and binary data, so this is merely a question of extending the existing padma code to handle more general situations.

2. Can padma integrate omics data that cannot be summarized to the gene-level? (e.g., SNP genotyping or metabolic data)

As it is currently written, padma is intended to provide inference at the pathway-level, where members of each pathway correspond to genes. In theory the MFA could be performed using each omic as a table (rather than each gene), in which case no gene-level summarization would be required. padma is not currently equipped to handle this more general setting.

# To-do list and ideas for work in progress

The following items are on my to-do list for improvements:

• Incorporation of known hierarchical structure among genes in a given pathway