Gene Set Enrichment Meta-Analysis with GSEMA package
In GSEMA: Gene Set Enrichment Meta-Analysis

knitr::opts_chunk$set(echo = TRUE)

Introduction

GSEMA is a package designed to perform gene set enrichment meta-analysis. The gene set enrichment meta-analysis allows for obtaining the differentially regulated gene sets or pathways that are shared across various studies.
Specifically, GSEMA applies a meta-analysis based on the combination of effect sizes obtained from single sample enrichment (SSE) analysis applied to individual the different studies. The different effect sizes of each study for each of the gene sets are calculated using a difference of means based on the enrichment scores obtained from SSE analysis. GSEMA offers various methods to obtain the statistics for the difference of means.

Subsequently, GSEMA allows for applying the various steps typical of a meta-analysis, in a similar way to the gene expression meta-analysis (@Toro_2020), although different adjustments to the calculated statistics are performed in order to correct the possible existence of false positive bias.

This vignette provides a tutorial-style introduction to all the steps necessary to properly conduct gene set enrichment meta-analysis using the GSEMA package.

Previous steps: Meta-analysis object

GSEMA uses a specific object as input, which is a list of nested lists where each nested list corresponds to a study. This object can be created directly by the users or they can use createObjectMApath() function to create it.

For the examples that are going to be shown, synthetic data will be used. We load the sample data into our R session.

library(GSEMA)
data("simulatedData")

study1Ex, study2Ex: two expression matrices.
study1Pheno, study2Pheno: two phenodata objects.
GeneSets: a list of gene sets with each element are the genes that belong to a pathway.
objectMApathSim: the meta-analysis object created from the different expression matrices and phenodatas.

Meta-analysis object creation (objectMApath)

As previously stated, the meta-analysis input in GSEMA is a list of nested lists. Each nested list contains two elements:

A gene set matrix with gene sets in rows and samples in columns *A vector of 0 and 1 indicating the group of each sample. 0 represents reference group (usually controls) and 1 represents experimental group (usually cases).

This object can be created directly by the user or we can make use of createObjectMApath() function, which creates the objectMApath after indicating:

the reference group (refGroups) and the experimental group (expGroups) in the phenodata.
the gene sets (geneSets) that will be consider.
the single sample enrichment method to applied to calculate the gene sets matrices

createObjectMApath() function allows to create the object needed to perform meta-analysis. In this case, it is necessary to indicate as input of the function the variables that contain the experimental and reference groups:

listEX: a list of dataframes or matrix (genes in rows and samples in columns). A list of ExpressionSets can be used too:

#List of expression matrices
listMatrices <- list(study1Ex, study2Ex)

listPheno: a list of phenodatas (samples in rows and covariables in columns). If the object listEX is a list of ExpressionSets this element can be null.

listPhenodata <- list(study1Pheno, study2Pheno)

namePheno: a list or vector of the different column names or column positions from the pheno used for perfoming the comparison between groups. Each element of namePheno correspont to its equivalent element in the listPheno. (default a vector of 1, all the first columns of each elements of listPheno are selected)
expGroups: a list or vector of the group names or positions from namePheno variable used as experimental group (cases) to perform the comparison (default a vector of 1, all the first groups are selected).
refGroups: a list or vector of the group names or positions from namePheno variable used as reference group (controls) to perform the comparison (default a vector of 2, all the second groups are selected).

It is important to note that if any element does not belong to the experimental or the reference group, that sample is not taken into account in the creation of meta-analysis object.

Here, we have included an example to show how exactly the function is used:

Since this function can be a bit complicated if there are many datasets, we recommend creating a vector to keep the column names of the phenodatas that contains the variable that identifies the groups to compare (namePheno argument). Moreover, we should create two others lits to indicate how to identify experimental (cases) and reference (controls) groups in these variables (expGroups and refGroups arguments).

If we look at the example phenodatas we can observed that the groups variable is "condition". Experimental group is named as "Case" and reference group as "Healthy".

study1Pheno$Condition

geneSets: a list of gene sets with each element are the genes that belong to a pathway:

head(names(GeneSets))
GeneSets[[1]]

pathMethod: a character string indicating the method to calculate the gene sets matrices. The available methods are:
- "GSVA": Gene Set Variation method (@Hanzelmann_2013)
- "Zscore": Z-score method (@Lee_2008)
- "ssGSEA": Single Sample Gene Set Enrichment Analysis method (@Barbie_2009)
- "Singscore": Single sample scoring of molecular phenotypes (@Foroutan_2018)
minSize: Minimum size of the resulting gene sets after gene identifier mapping. By default, the minimum size is 7.
kdcf: Only neccesary for the GSVA method. Character vector of length 1 denoting the kernel to use during the non -parametric estimation of the cumulative distribution function of expression levels across samples. By default, kcdf="Gaussian" which is suitable when input expression values are continuous, such as microarray. When input expression values are integer counts, such as those derived from RNA-seq experiments, then this argument should be set to kcdf="Poisson".
normalize: boolean specifying if the gen set matrices should be normalized. Default value "TRUE".
n.cores: Number of cores to use in the parallelization of the datsets. By default, n.cores=1.
internal.n.cores: Number of cores to use in the parallelization of the single sample enrichment methods. By default internal.n.cores= 1.

In parallelization, several aspects must be considered. "n.cores" refers to the parallelization of studies or datasets. Therefore, if we have 3 studies, the maximum number for "n.cores" will be 3. internal.n.cores refers to the parallelization of single sample enrichment methods. This is especially recommended for the ssGSEA method. For Singscore and GSVA, it may also be advisable. The process is parallelized based on the samples in each study. Therefore, the larger the number of samples, the slower the process will be. The number of cores that the computer will use is the multiplication of both parameters n.cores * internal.n.cores = total cores.

We all this information we can create the vector for namePheno argument and the two list for expGroups and refGroups:

listMatrices <- list(study1Ex, study2Ex)
listPhenodata <- list(study1Pheno, study2Pheno)
phenoGroups <- c("Condition","Condition", "Condition")
phenoCases <- list("Case", "Case", "Case")
phenoControls <- list("Healthy", "Healthy", "Healthy")
objectMApathSim <- createObjectMApath(
    listEX = listMatrices,
    listPheno = listPhenodata, namePheno = phenoGroups,
    expGroups = phenoCases, refGroups = phenoControls,
    geneSets = GeneSets,
    pathMethod = "Zscore",
    n.cores = 1,
    internal.n.cores = 1, minSize = 7)

The result obtained is the proper object to perform meta-analysis (objectMApath).

Performing Meta-analysis

GSEMA package has implemented gene set enrichment meta-analysis techniques based on the combination of effect sizes in the metaAnalysisDEpath() function. Although this function can be applied directly, it is advisable to consider some previous steps.

Filtering paths with low expression

Before performing the meta-analysis, it is important to filter out those pathways with low expression in the datasets. This is because the results of the meta-analysis could be biased by the presence of pathways with low expression, which may lead to an increase in the number of false positives. GSEMA provides a function called *filterPaths() that allows to filter out those pathways with low expression. The inputs of this function are:

objectMApath: the meta-analysis object of GSEMA package.
threshold:A number that indicates the threshold to eliminate a gene set. For a eliminate a gene set is necessary that the median for both groups are less than the threshold. If threshold = "sd" the threshold will be the standard deviation of the gene set. The default value is 0.65.

objectMApathSim <- filteringPaths(objectMApathSim, threshold = 0.65)

Calculation of the effects sizes

GSEMA has implemented three different estimator for calculating the effect sizes in each study:

The "limma" method used the limma package (@limmaR_2020) to calculate the effect size and the variance of the effect size. The effect size is calculated from the moderated Student's t computed by limma. From it, the estimator of Hedges'g and its corresponding variance are obtained. In this way, some of the false positives obtained by the "SMD" method are reduced.
The "SMD" (Standardized mean different) method calculates the effect size using the Hedges'g estimator @Hedges_1981.
The "MD" (raw mean different) calculates the effects size as the difference between the means of the two groups (@Borenstein_2021).

Performing meta-analysis: metaAnalysisDEpath()

The metaAnalysisDEpath() function allows to perform a meta-analysis in only one step, needing only the meta-analysis object created previously.

This function has as input:

objectMApath: The meta-analysis object of GSEMA package.
effectS: A list of two elements. The first element is a dataframe with gene sets in rows and studies in columns. Each component of the dataframe is the effect of a gene set in a study. The second element of the list is also a dataframe with the same structure, but in this case each component of the dataframe represent the variance of the effect of a gene set in a study. This argument should be only used in the case that objectMApath argument is null.
measure: A character string that indicates the type of effect size to be calculated. The options are "limma", "SMD" and "MD". The default value is "limma". See details for more information.
WithinVarCorrect A logical value that indicates if the within variance correction should be applied. The default value is TRUE. See details for more information (cite)
typeMethod: a character that indicates the method to be performed:
- "FEM": Fixed Effects model.
- "REM": Random Effects model.
missAllow: a number between 0 and 1 that indicates the maximum proportion of missing values allows in a sample. If the sample has more proportion of missing values, the sample will be eliminated. In the other case, the missing values will be imputed by using the K-NN algorithm included in impute package (@Impute_2019). In case the objectMA has been previously imputed, this element is not necessary.
numData: a number between 0 and 1 that indicates the minimum number of datasets in which a pathway must be contained to be included.

In the following example, we have applied a Random Effect model to the GSEMA object ("objectMApathSim") we have been working with so far. In addition we have applied the "limma" method to calculate the effect size. Finally, we have allowed a 0.3 proportion of missing values in a sample and a gene set must have been contained in all studies:

results <- metaAnalysisESpath(objectMApath = objectMApathSim,
    measure = "limma", typeMethod = "REM", missAllow = 0.3, 
    numData = length(objectMApathSim))

The output of this function is a dataframe with the results of the meta-analysis where rows are the genes and columns are the different variables provided by the meta-analysis:

results[1,]
results[nrow(results),]

Meta analysis results

The results are provided in a dataframe with the variables:

Com.ES: combined effect of the gene set.
ES.var: variance of the combined effect of the gene set.
Qval: total variance of the gene set.
tau2: between-study variance of the gene set.
zval: combined effect value for a standard normal. I can be use in order to find out if the gene set is overexpressed (positive value) or underexpressed (negative value).
Pval: P-value of the meta-analysis for the gene set.
FDR: P-value adjusted of the meta-analysis for the gene set.
numDataset: Number of the datasets in which the gene set is included.

Visualization of the results: heatmap

Finally, we can represent in a heatmap the significant gene sets in order to observe how they are regulated in each of the studies. In heatmapPaths() function we have to include both the object that has been used in the meta-analysis, the result of it and the applied method. In addition, this package offers three different scaling approaches (scaling) in order to compare properly the gene set enrichment scores of the studies in the heatmap:

"zscor": It calculates a z-score value for each gene, that is, the mean gene expression from each gene is subtracted from each gene expression value and then it is divided by the standard deviation.
"swr": Scaling relative to reference dataset approach @Lazar_2013.
"rscale": It uses the rescale function of the scales package to scale the gene expresion @scalesR_2020.
"none": no scaling approach is applied.

Moreover, in regulation argument, we can choose if we want to represent the over regulated or under regulated gene sets:

"up": only up-regulated gene sets are represented.
"down: only down-regulated gene sets are represented
"all": up-regulated and down-regulated gene sets are represented.

We can choose the number of significant gene sets (numSig) that we want to be shown on the graph and the adjusted p-value from which a gene set is considered as significant (fdrSig). In addition, the gene sets that are not presented in one sample are represented in gray.

Here we present an example of the heatmap which have been obtained from the result of applying a random effects model to the object "objectMApathSim" and making use of a "zscor" scaling approach.

res <- heatmapPaths(objectMA=objectMApathSim, results,
            scaling = "zscor", regulation = "all",breaks=c(-2,2),
            fdrSig = 0.05, comES_Sig = 1, numSig=20, fontsize = 5)

Additional information

Calculating individual Effects size

The calculateESpath() function returns the effects size in each of the studies. Moreover, it calculates the variance of each of the effects.

#Effects <- calculateESpath(objectMApath = objectMApathSim, measure = "limma")
#Effects[[1]]["Simulated_Pathway",]
#Effects[[2]]["Simulated_Pathway",]

Session info

sessionInfo()

References

Any scripts or data that you put into this service are public.

GSEMA documentation built on Oct. 14, 2024, 5:09 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

GSEMA
Gene Set Enrichment Meta-Analysis

Gene Set Enrichment Meta-Analysis with GSEMA package
In GSEMA: Gene Set Enrichment Meta-Analysis

Introduction

Previous steps: Meta-analysis object

Meta-analysis object creation (objectMApath)

Performing Meta-analysis

Filtering paths with low expression

Calculation of the effects sizes

Performing meta-analysis: metaAnalysisDEpath()

Meta analysis results

Visualization of the results: heatmap

Additional information

Calculating individual Effects size

Session info

References

Try the GSEMA package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

GSEMA Gene Set Enrichment Meta-Analysis

Gene Set Enrichment Meta-Analysis with GSEMA package In GSEMA: Gene Set Enrichment Meta-Analysis

Introduction

Previous steps: Meta-analysis object

Meta-analysis object creation (objectMApath)

Performing Meta-analysis

Filtering paths with low expression

Calculation of the effects sizes

Performing meta-analysis: metaAnalysisDEpath()

Meta analysis results

Visualization of the results: heatmap

Additional information

Calculating individual Effects size

Session info

References

Try the GSEMA package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

GSEMA
Gene Set Enrichment Meta-Analysis

Gene Set Enrichment Meta-Analysis with GSEMA package
In GSEMA: Gene Set Enrichment Meta-Analysis