knitr::opts_chunk$set( collapse=TRUE, comment="#>", message=FALSE, warning=FALSE )
time <- Sys.time()
This vignette describes the general workflow for running
r Githubpkg("montilab/K2Taxonomer")
on gene expression data [@reed_2020].
While this vignette describes an analysis of bulk gene expression data, many of
these steps are shared with that of single-cell expression analyses. A vignette
for running r Githubpkg("montilab/K2Taxonomer")
on single-cell expression data
can be found
here.
## K2Taxonomer package library(K2Taxonomer) ## For example expression data library(Biobase) ## For drawing dendrograms library(ggdendro)
ExpressionSet
objectThe main input of r Githubpkg("montilab/K2Taxonomer")
is an ExpressionSet
object with approximately normally distributed expression values. Here we read
in an example data set. See ?sample.ExpressionSet for more information about
these data.
data(sample.ExpressionSet)
r Githubpkg("montilab/K2Taxonomer")
basic runK2
objectThe K2preproc()
initializes the K2
object and runs pre-processing steps.
Here, you can specify all arguments used throughout the analysis. Otherwise,
you can specify these arguments within the specific functions for which they
are implemented. See help pages for more information.
Here we will start by using the defaults.
## Run pre-processing K2res <- K2preproc(sample.ExpressionSet)
Note: This data set is microarray expression data. If using normalized and
log-transformed count data then it's generally recommended that you set
logCounts=TRUE . This will more appropriately perform differential gene
expression analyses with the r Biocpkg("limma")
R package [@limma].
## NOT RUN K2res <- K2preproc(sample.ExpressionSet, logCounts=TRUE)
K2
object structrer Githubpkg("montilab/K2Taxonomer")
functions create and manipulate S4, K2
objects, defined within the package. K2
objects have 5 slots.
K2eSet()
.K2data()
.r Githubpkg("montilab/K2Taxonomer")
algorithm, extracted by K2results
.K2info()
.r Githubpkg("montilab/K2Taxonomer")
algorithm, including differential
analysis and enrichment results, extracted by K2results()
K2genesets
.K2gene2Pathway
.K2geneURL()
.K2genesetURL()
.K2preproc()
will fill in three of these slots:, eSet, meta, and
dataMatrix.
## Extract ExpressionSet K2eSet(K2res) ## Extract dataMatrix dim(K2data(K2res))
meta will be populated by default a set of parameters entered into
K2preproc()
. Note that for each step, new parameters may be entered,
specific to that function.
## Extract meta data names(K2meta(K2res))
r Githubpkg("montilab/K2Taxonomer")
algorithmThe r Githubpkg("montilab/K2Taxonomer")
is run by K2tax()
. At each
recursion of the algorithm, the observations in dataMatrix are partitioned
into two sub-groups based on a compilation of repeated K=2 clustering on
bootstrapped sampling of features. For each partition in the recursion, a
stability metric is used to estimate robustness, which takes on values between
0 and 1, where values close to 1 represent the instance in which the same
clustering occured in every or nearly every perturbation of a large set of
observations. As the number of observations decreasing down the taxonomy the
largest possible stability estimate decreases, such that the largest possible
stability estimate of triplets and duplets, is 0.67 and 0.50, respectively.
The parameter, stabThresh, controls the minimum value of the stability
metric to continue partitioning the observations. By default, stabThresh
is set to 0, which will run the algorithm until until all observations fall
into singlets. If we set stabThresh=0.5 the algorithm can not separate
duplets, as well as larger sets that demonstrate poor stability when a
partition is attempted. This can also be set during initialization with
K2preproc()
.
Choosing an appropriate threshold is dependent on the size of the original data set. For large data sets, choosing small values will greatly increase runtime, and values between 0.8 and 0.7, are generally recommended.
## Run K2Taxonomer aglorithm K2res <- K2tax(K2res, stabThresh=0.5)
r Githubpkg("montilab/K2Taxonomer")
results structureThe results slot of the K2
object is a named list of the results of
running the r Githubpkg("montilab/K2Taxonomer")
algorithm. Each item in the
list contains the IDs of the two sets of observations denoted by the
partitions, as well as additional annotation for that partition, including
the stability estimates.
Each item is assigned a letter ID. If there are more than 26 items, than an additional letter is tacked on, such that the 27th item is named, AA, and the 28th item is named BB.
## Labels of each partition names(K2results(K2res))
## Get observations defined in each partition K2results(K2res)$A$obs
Stability metrics measure the extant to which pairs of observations in the data setpartitioned together for a given partition of the data.
## Get bootstrap probability of this partition K2results(K2res)$A$bootP ## Node stability (Value between 0 and 1. 1 indicating high stability) K2results(K2res)$A$stability$node ## Stability of each subseqent subgroup. K2results(K2res)$A$stability$clusters ## Distance matrix indicating pairwise cosine similarity of each observation partitionAstability <- as.matrix(K2results(K2res)$A$stability$samples) ## Show heatmap hcols <- c("grey", "black")[colnames(partitionAstability) %in% K2results(K2res)$A$obs[[1]] + 1] heatmap(partitionAstability, distfun=function(x) as.dist(1-x), hclustfun=function(x) hclust(x, method="mcquitty"), col=hcl.colors(12, "Blue-Red"), scale="none", cexRow=0.7, cexCol=0.7, symm=TRUE, ColSideColors=hcols, RowSideColors=hcols, keep.dendro=FALSE)
r Githubpkg("montilab/K2Taxonomer")
resultsAfter running the r Githubpkg("montilab/K2Taxonomer")
algorithm, a
dendrogram
object can be built using K2dendro()
.
## Get dendrogram from K2Taxonomer dendro <- K2dendro(K2res) ## Get dendrogram data ggdendrogram(dendro)
r Githubpkg("montilab/K2Taxonomer")
resultsA principal motivation behind r Githubpkg("montilab/K2Taxonomer")
is that
insight can be garnered and tracjed for various levels in a hierarchical
sub-grouping of obervations, such that analytical characterization of
different sub-groups throughout the taxonomy serve to both validate and
inform about specific sub-groups.
To this end, r Githubpkg("montilab/K2Taxonomer")
runs additional analyses to
annotate each partition. These analyses include: differential expression
analysis using the r Biocpkg("limma")
package and gene set enrichment testing
for both hyperenrichment and differntial analysis of singe-sample enrichment
scores estimated using the r Biocpkg("GSVA")
package [@limma][@gsva].
Differential analysis in r Githubpkg("montilab/K2Taxonomer")
is performed
using r Biocpkg("limma")
. Differential analysis is performed for each
partition of the data, comparing the gene expression between each pair
of subgroups across all partitions.
## Run DGE K2res <- runDGEmods(K2res) ## Get differential results for one partition head(K2results(K2res)$A$dge) ## Concatenate all results and generate table DGEtable <- getDGETable(K2res) head(DGEtable)
See ?getDGETable for a description of the columns in this data frame.
In this example, we are running r Githubpkg("montilab/K2Taxonomer")
without
information specifying a set of observations to treat as a control group, for
which to make comparisons in regards to the directionality of differential
expression. In the absence of this designation,
r Githubpkg("montilab/K2Taxonomer")
assigns genes to the partition-specific
subgroup with the larger mean. Accordingly, the of coeficient (the "coef" in
the output of getDGETable()
) is always positive, indicating the difference of
mean of the gene expression of the assigned subgroup from the other subgroup.
Moreover, the "direction" of this gene will always be assigned to up.
Following the full description of this analysis, we describe methods for
running r Githubpkg("montilab/K2Taxonomer")
, specifying a cohort variable
in the data set. When including this variable, the user may assign one of the
levels of this variable as the control group. With the inclusion of this
designation, r Githubpkg("montilab/K2Taxonomer")
will assign directionality
to the differences in gene expression.
r Githubpkg("montilab/K2Taxonomer")
performed both hyperenrichment and
differential single-sample enrichment tests based on an input of gene sets
using the genesets argument, which can be specified in runGSEmods()
,
as demonstrated here, or during initialization with K2preproc()
. Here, the
gene sets are made up.
Gene set hyperenrichment is performed on the top genes for each subgroup, as
defined by differential analysis with the runGSEmods()
. The arguments
qthresh and cthresh specify thresholds for defining top genes, based on
FDR Q-value and log fold-change, respectively. Here, qthresh=0.1, such
that genes with FDR Q-value < 0.1 will be assigned as the top genes for a
given partition.
## Create dummy set of gene sets genes <- unique(DGEtable$gene) genesetsMadeUp <- list( GS1=genes[1:50], GS2=genes[51:100], GS3=genes[101:150]) ## Run gene set hyperenrichment K2res <- runGSEmods(K2res, genesets=genesetsMadeUp, qthresh=0.1)
Single-sample enrichment is performed in two steps. First, the
r Biocpkg("GSVA")
package is used to generate observation-level enrichment
scores, followed by differential analysis using the r Biocpkg("limma")
package. This is performed by the functions runGSVAmods()
and
runDSSEmods()
, respectively. Here, we can specify which enrichment algorithm
to use when running the r Biocpkg("GSVA")
functionality and the number of CPU
cores to use with the ssGSEAalg and ssGSEAcores arguments,
respectively. These can also be set in the K2preproc()
function. Note, that
the arguments shown are the defaults.
## Create expression matrix with ssGSEA(GSVA) estimates K2res <- runGSVAmods(K2res, ssGSEAalg="gsva", ssGSEAcores=1, verbose=FALSE) ## Extract ernichment score Expression Set object K2gSet(K2res)
## Run differential analysis on enrichment score Expression Set K2res <- runDSSEmods(K2res) ## Extract table of all hyper and sample-level enrichment tests K2enrRes <- getEnrichmentTable(K2res) head(K2enrRes)
See ?getEnrichmentTable for a description of the columns in this data frame.
In addition to annotating subgroups based on gene and pathway activity,
r Githubpkg("montilab/K2Taxonomer")
includes functionality to perform
statistical testing of these subgroups based on known "phenotypic" information.
For example this data set includes three phenotypic variables, including two
binary variables, "sex" and "type", and a continuos variable,
"score".
print(str(pData(sample.ExpressionSet)))
Note: Unlike differential gene or pathway analysis, the phenotypic variable tests are run one-versus-all, comparing the members of an individual subgroup to all other observations.
For categorical variables, over-representation of a specific label in a given subgroup is evaluated using Fisher's Exact Test. For binary variables, like in this case, this may be run as a one-sided test, in such case the over-representation of only the second factor level in a given subgroup is evaluated. For example, as it is currently coded, a one-sided test of the "type" variable would only assess whether a subgroup had significantly greater observations labeled as "Case". Alternatively, the test can be run as two-sided, if subgroups of over-representation of "Control" observations is also of interest.
Note: For categorical variables with more than two levels, only two-sided tests are possible.
For continuous, statistical differences can be evalutated user either the parametric Students' T-test or the non-parametric Wilcoxon rank-sum test. As with categorical variables, these tests can be either one-sided (higher mean (T-test) or rank (Wilcox) in subgroup, or two-sided (higher or lower mean in node).
We specify the type of test to run as a named vector using the infoClass argument, in which the values of this vector indicates the type of test to run and the names indicate the variable for which to run the test. Value options are ...
An example of this vector is shown below, specifying the following tests:
infoClassVector <- c( sex="factor", type="factor1", score="numeric1")
After creating the named vector, phenotypic variable testing is performed using
the runTestsMods()
function.
## Run phenotype variable tests K2res <- runTestsMods(K2res, infoClass=infoClassVector) ## Get differential results for one partition head(K2results(K2res)$A$modTests) ## Concatenate all results and generate table modTestsTable <- getTestsModTable(K2res) head(modTestsTable)
See ?getTestsModTable for a description of the columns in this data frame.
r Githubpkg("montilab/K2Taxonomer")
can automatically generate an interactive
dashboard for which to parse through the differential analysis results. This
creates it's own sub-directory with a compilable .Rmd file.
For more information go here.
## NOT RUN K2dashboard(K2res)
runK2Taxonomer
wrapperAlternatively, K2Taxonomer can be run in one step with the runK2Taxonomer()
function. This function takes all of the same arguments as the K2preproc()
function, and runs all of the steps above.
K2res <- runK2Taxonomer( eSet=sample.ExpressionSet, genesets=genesetsMadeUp, infoClass=infoClassVector, stabThresh=0.5)
If there is replicates in the data or if the user is only interested in
capturing subgroup relationships between known groups without partitioning the
observations themselves, one can specify the cohorts variable of interest
in the ExpressionSet
object.
## Create set of cohorts from data sample.ExpressionSet$group <- paste0(sample.ExpressionSet$sex, sample.ExpressionSet$type) ## Run pre-processing K2res <- K2preproc(sample.ExpressionSet, cohorts="group") ## Cluster the data K2res <- K2tax(K2res) ## Create dendrogram dendro <- K2dendro(K2res) ggdendrogram(dendro)
When running with cohorts, r Githubpkg("montilab/K2Taxonomer")
includes the
option of treating one of the levels within the cohorts variable as a
control by specifying this value in the vehicle argument. In doing so, the
values of each of the other cohorts will be evaluated in relation to this value.
## Run pre-processing K2res <- K2preproc(sample.ExpressionSet, cohorts="group", vehicle="FemaleControl") ## Cluster the data K2res <- K2tax(K2res) ## Create dendrogram dendro <- K2dendro(K2res) ggdendrogram(dendro)
By specifying a control value, r Githubpkg("montilab/K2Taxonomer")
is able to
assess directionality to differential analyses in relation to the control
group, assigning direction of test statistics and gene assignments based on the
subgroup in which the difference of the mean is more significantly different
from the control group.
## Run DGE K2res <- runDGEmods(K2res) ## Get differential results for one partition head(K2results(K2res)$A$dge) ## Concatenate all results and generate table DGEtable <- getDGETable(K2res) head(DGEtable)
Accordingly, r Githubpkg("montilab/K2Taxonomer")
can now establish sets of
genes that are both up- and down-regulated within a specific subgroup. We can
now perform hyperenrichment with these additional gene sets.
## Run gene set hyperenrichment K2res <- runGSEmods(K2res, genesets=genesetsMadeUp, qthresh=0.1)
Finally, differential analysis of single-sample gene set enrichment scores is performed in the same fashion as with differential analysis of gene expression, assessing directionality of differences in enrichment scores based on comparisons to the assigned control group.
## Create expression matrix with ssGSEA(GSVA) estimates K2res <- runGSVAmods(K2res, ssGSEAalg="gsva", ssGSEAcores=1, verbose=FALSE) ## Run differential analysis on enrichment score Expression Set K2res <- runDSSEmods(K2res) ## Extract table of all hyper and sample-level enrichment tests K2enrRes <- getEnrichmentTable(K2res) head(K2enrRes)
Here, we see that, although not statistically significant ("fdr_limma"=0.25), the enrichment score of "GS2" deviated further from the control group in subgroup, "node"="B"; "edge"="2", compared to subgroup, "node"="B"; "edge"="1".
Missing values for hyperenrichment analysis indicated that no genes were assigned to these specific subgroups and direction combinations. Missing values for differential enrichment score analysis indicate that, while genes were assigned to these specific subgroups and direction combinations, the enrichment score of the gene set deviated further from control in the alternative subgroup for that partition.
For each of the four statistical testing steps: differential gene expression, gene set hyperenrichment, differential single-sample gene set enrichment, and phenotypic variable testing, multiple hypothesis correction is performed across the aggregate set of p-values across all partitions using the Benjamini-Hochberg FDR correction [@bh]. This correction is performed independently for each of the four testing steps.
vehicle != NULL with featMetric="F" or recalcDataMatrix=TRUE
Associations between the control cohort and remaining cohorts are generated prior to recursive partitioning. Either of these two argument specifications recalculate cohort-level statistics at each partition, many of which will not include the control cohort.
covariates != NULL with featMetric="F" or recalcDataMatrix=TRUE
Adjustments of gene expression values for covariates is performed prior to recursive partitioning. Either of these two argument specifications recalculate cohort-level statistics at each partition, some of which will not will not have any variation of this one or more of the covariates.
cohorts != NULL with infoClass != NULL
Due to additional statistical and computational considerations for modeling data cohort-level information, currently phenotypic variable testing can only be performed for observation-level analysis.
Sys.time() - time
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.