tsea_mGSEA: Target Set Enrichment Analysis (TSEA) with mGSEA Algorithm
In signatureSearch: Environment for Gene Expression Searching Combined with Functional Enrichment Analysis

Description Usage Arguments Details Value Column description References See Also Examples

The tsea_mGSEA function performs a Modified Gene Set Enrichment Analysis (mGSEA) that supports test sets (e.g. genes or protein IDs) with duplications. The duplication support is achieved by a weighting method for duplicated items, where the weighting is proportional to the frequency of the items in the test set.

tsea_mGSEA(
  drugs,
  type = "GO",
  ont = "MF",
  nPerm = 1000,
  exponent = 1,
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  minGSSize = 5,
  maxGSSize = 500,
  verbose = FALSE,
  dt_anno = "all",
  readable = FALSE
)

`drugs`	character vector containing drug identifiers used for functional enrichment testing. This can be the top ranking drugs from a GESS result. Internally, drug test sets are translated to the corresponding target protein test sets based on the drug-target annotations provided under the `dt_anno` argument.
`type`	one of 'GO', 'KEGG' or 'Reactome'
`ont`	character(1). If type is 'GO', assign `ont` (ontology) one of 'BP','MF', 'CC' or 'ALL'. If type is 'KEGG' or 'Reactome', `ont` is ignored.
`nPerm`	integer defining the number of permutation iterations for calculating p-values
`exponent`	integer value used as exponent in GSEA algorithm. It defines the weight of the items in the item set S.
`pAdjustMethod`	p-value adjustment method, one of 'holm', 'hochberg', 'hommel', 'bonferroni', 'BH', 'BY', 'fdr'
`pvalueCutoff`	double, p-value cutoff
`minGSSize`	integer, minimum size of each gene set in annotation system
`maxGSSize`	integer, maximum size of each gene set in annotation system
`verbose`	TRUE or FALSE, print message or not
`dt_anno`	drug-target annotation source. Currently, one of 'DrugBank', 'CLUE', 'STITCH' or 'all'. If 'dt_anno' is 'all', the targets from the DrugBank, CLUE and STITCH databases will be combined. Usually, it is recommended to set the 'dt_anno' to 'all' since it provides the most complete drug-target annotations. Choosing a single annotation source results in sparser drug-target annotations (particularly CLUE), and thus less complete enrichment results.
`readable`	TRUE or FALSE, it applies when type is 'KEGG' or 'Reactome' indicating whether to convert gene Entrez ids to gene Symbols in the 'itemID' column in the result table.

The original GSEA method proposed by Subramanian et at., 2005 uses predefined gene sets S defined by functional annotation systems such as GO and KEGG. The goal is to determine whether the genes in S are randomly distributed throughout a ranked test gene list L (e.g. all genes ranked by log2 fold changes) or enriched at the top or bottom of the test list. This is expressed by an Enrichment Score (ES) reflecting the degree to which a set S is overrepresented at the extremes of L.

For TSEA, the query is a target protein set where duplicated entries need to be maintained. To perform GSEA with duplication support, here referred to as mGSEA, the target set is transformed to a score ranked target list L_tar of all targets provided by the corresponding annotation system. For each target in the query target set, its frequency is divided by the number of targets in the target set, which is the weight of that target. For targets present in the annotation system but absent in the target set, their scores are set to 0. Thus, every target in the annotation system will be assigned a score and then sorted decreasingly to obtain L_tar.

In case of TSEA, the original GSEA method cannot be used directly since a large portion of zeros exists in L_tar. If the scores of the genes in set S are all zeros, N_R (sum of scores of genes in set S) will be zero, which cannot be used as the denominator. In this case, ES is set to -1. If only some genes in set S have scores of zeros then N_R is set to a larger number to decrease the weight of the genes in S that have non-zero scores.

The reason for this modification is that if only one gene in gene set S has a non-zero score and this gene ranks high in L_tar, the weight of this gene will be 1 resulting in an ES(S) close to 1. Thus, the original GSEA method will score the gene set S as significantly enriched. However, this is undesirable because in this example only one gene is shared among the target set and the gene set S. Therefore, giving small weights (lowest non-zero score in L_tar) to genes in S that have zero scores could decrease the weight of the genes in S that have non-zero scores, thereby decreasing the false positive rate. To favor truly enriched functional categories (gene set S) at the top of L_tar, only gene sets with positive ES are selected.

feaResult object, the result table contains the enriched functional categories (e.g. GO terms or KEGG pathways) ranked by the corresponding enrichment statistic.

The TSEA results (including tsea_mGSEA) stored in the feaResult object can be returned with the result method in tabular format, here tibble. The columns of this tibble are described below.

enrichmentScore: ES from the GSEA algorithm (Subramanian et al., 2005). The score is calculated by walking down the gene list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not. The magnitude of the increment depends on the gene scores. The ES is the maximum deviation from zero encountered in the random walk. It corresponds to a weighted Kolmogorov-Smirnov-like statistic.
NES: Normalized enrichment score. The positive and negative enrichment scores are normalized separately by permutating the composition of the gene list L nPerm times, and dividing the enrichment score by the mean of the permutation ES with the same sign.
pvalue: The nominal p-value of the ES is calculated using a permutation test. Specifically, the composition of the gene list L is permuted and the ES of the gene set is recomputed for the permutated data generating a null distribution for the ES. The p-value of the observed ES is then calculated relative to this null distribution.
leadingEdge: Genes in the gene set S (functional category) that appear in the ranked list L at, or before, the point where the running sum reaches its maximum deviation from zero. It can be interpreted as the core of a gene set that accounts for the enrichment signal.
ledge_rank: Ranks of genes in 'leadingEdge' in gene list L.

Additional columns are described under the 'result' slot of the feaResult object.

GSEA algorithm: Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545-15550. URL: https://doi.org/10.1073/pnas.0506580102

feaResult, fea

data(drugs10)
## GO annotation system
#res1 <- tsea_mGSEA(drugs=drugs10, type="GO", ont="MF", exponent=1, 
#                   nPerm=1000, pvalueCutoff=1, minGSSize=5)
#result(res1)
#res2 <- tsea_mGSEA(drugs=drugs10, type="KEGG", exponent=1, 
#                   nPerm=100, pvalueCutoff=1, minGSSize=5)
#result(res2)
## Reactome annotation system
#res3 <- tsea_mGSEA(drugs=drugs10, type="Reactome", pvalueCutoff=1)
#result(res3)