MCbiclust is a R package for running massively correlating biclustering analysis. MCbiclust aims to find large scale biclusters with selected features being highly correlated with each other over a subset of samples. MCbiclust was particularly designed for the application of studying gene expression data, finding and understanding biclusters that are related to large scale co-regulation of genes.
Report issues on https://github.com/rbentham/MCbiclust/issues
Once installed MCbiclust can be loaded with the following command:
library(MCbiclust)
MCbiclust also makes sure that the packages BiocParallel, cluster, stats, GGally, ggplot2 and scales are all installed. It is also advised that the packages ggplot2 and gplots are separately installed and loaded.
library(ggplot2) library(gplots) library(dplyr) library(gProfileR) library(MASS) library(devtools)
For this example analysis we will be seeking to find biclusters related to mitochondrial function in the cancer cell line encyclopedia (@barretina2012cancer). For this two datasets are needed, both of which are available on the MCbiclust package. The first in CCLE_small
that contains a subset of the gene expression values found in the entire CCLE data set (the full dataset is avaliable at https://portals.broadinstitute.org/ccle/home), the second, Mitochondrial_genes
, is a list of mitochondrial genes that can be found from MitoCarta1.0 (@pagliarini2008mitochondrial).
data(CCLE_small) data(Mitochondrial_genes)
It is a simple procedure to create a new matrix CCLE.mito
only containing the mitochondrial genes. While there are $1023$ known mitochondrial genes, not all of these are measured in CCLE_data
.
mito.loc <- which(row.names(CCLE_small) %in% Mitochondrial_genes) CCLE.mito <- CCLE_small[mito.loc,]
The first step in using MCbiclust is to find a subset of samples that have the most highly correlating genes in the chosen gene expression matrix. This is done by, calculating the associated correlation matrix and then calculating the absolute mean of the correlations, as a correlation score.
Mathematically for a gene expression dataset measuring multiple gene probes across multiple samples, let \begin{equation} X = \textrm{Set of all probes, } Y = \textrm{Set of all samples} \end{equation} Then define two subsets of $X$ and $Y$, $I$ and $J$ repectively \begin{equation} I \subset X \textrm{ and } J \subset Y \end{equation} Subsets $I$ and $J$ form a bicluster on sets $X$ and $Y$, and the strength of this bicluster measured is based on measuring the correlations between pairs of probes in set $I$ across all samples in set $J$. The correlation between a probe $i \in I$ to a probe $k \in I$ across the samples in $J$ is denoted as $C_{i,k}^J$. Then the strength of the bicluster is measured as having a score $\alpha$ based on these correlations, defined as: \begin{equation} \alpha_I^J= \frac{1}{|I|^2}\sum_{i \in I} \sum_{k \in I} abs(C_{i,k}^J) \end{equation} where the function $abs()$ refers to the absolute value. In words the score $\alpha$ is the average of the absolute values of the gene-gene correlation matrix for gene-probe set $I$ across the samples in sample set $J$.
A high $\alpha_I^J$ value indicates that the probes in set $I$ are being strongly co-regulated across the samples in set $J$. As $\alpha_I^J$ is calculating using the absolute values of $C_{i,k}^J$, these probes could be in either in correlation or anti-correlation with each other.
MCbiclust
main aim is therefore to find sets of samples and genes that have a high $\alpha_I^J$ value. This is achieved by first finding a small sample "seed" containing relatively few samples but a very high $\alpha_I^J$ value,
This is achieved with function FindSeed
, initially a random subset of samples is chosen and then at each iteration one sample is removed and replaced and if this results in a higher $\alpha_I^J$ value than this new subset is chosen. In this function the argument gem
stands for gene expression matrix, seed.size
indicates the size of the subset of samples that is sought. iterations
indicates how many iterations of the algorithm to carry out before stopping. In general the higher the iterations the more optimal the solution in terms of maximising the strength of the correlation.
For reproducibility set.seed
has been used to set R's pseudo-random number generator. It should also be noted that the for gem
the data matrix can not contain all the genes, since FindSeed
involves the calculation of correlation matrices which are not computationally efficient to compute if they involve greater than ~1000 genes.
set.seed(102) CCLE.seed <- FindSeed(gem = CCLE.mito, seed.size = 10, iterations = 10000, messages = 1000)
CCLE.seed <- MCbiclust:::Vignette_seed
FindSeed
has one more additional options, initial.seed
allows the user to specify the initial subsample to be tested, by default the initial sample subset is randomly chosen.
There is a function CorScoreCalc
that can calculate the correlation score $\alpha_I^J$ directly, in general however you should not need to use it, unless you wish to manually check the chosen seed is an improvement on one that is randomly generated.
set.seed(103) random.seed <- sample(seq(length = dim(CCLE.mito)[2]), 10) CorScoreCalc(CCLE.mito,random.seed) CorScoreCalc(CCLE.mito,CCLE.seed)
The results of FindSeed
can also be visualised by examining the associated correlation matrix, and viewing the result as a heatmap. Again it is easy to see the difference between the random subsample and the one outputted from FindSeed
.
CCLE.random.cor <- cor(t(CCLE.mito[,random.seed])) heatmap.2(CCLE.random.cor,trace = "none")
CCLE.mito.cor <- cor(t(CCLE.mito[,CCLE.seed])) heatmap.2(CCLE.mito.cor,trace = "none")
Note that when the genes are represented as the rows in a matrix, that matrix needs to be transposed before the calculation of the correlation matrix.
heatmap.2
is a function from the gplots
R package.
As can be clearly seen from the heat map, not all the mitochondrial genes are equally strongly correlated to each other. There is a function in MCbiclust
which automatically selects those genes that are most strongly associated with the pattern. This function is HclustGenesHiCor
and it works by using hierarchical clustering to select the genes into n different groups, and then discarding any of these groups that fail to have a correlation score greater than the correlation score from all the genes together.
CCLE.hicor.genes <- as.numeric(HclustGenesHiCor(CCLE.mito, CCLE.seed, cuts = 8)) CCLE.mito.cor2 <- cor(t(CCLE.mito[CCLE.hicor.genes, CCLE.seed])) CCLE.heat <- heatmap.2(CCLE.mito.cor2,trace = "none")
There are two groups of genes, strongly correlated to themselves and anti-correlated to each other. These can be extracted from the dendrogram:
CCLE.groups <- list(labels(CCLE.heat$rowDendrogram[[1]]), labels(CCLE.heat$rowDendrogram[[2]]))
In this example a distinct correlation pattern was found. However this was only examined for genes involved in mitochondrial function. Non-mitochondrial genes are likely also involved in this pattern and it is important to identify them.
All genes can be measured by how they match to this pattern by calculating what is called a correlation vector (CV). This is done in two steps:
The pattern is summarised by finding a subset of genes which all strongly correlate with each other, and calculating their average expression value. This is done by clustering the genes using hierarchical clustering and selecting the best group judged by that groups correlation score multiplied by the square root of the number of genes. This multiplication is done to remove the bias of selecting a group containing very few genes.
The correlation vector is calculated by finding the correlation of every gene to the average expression value of the chosen best group.
This process is all encapsulated in the function CVEval
which takes 4 arguements. gem.part
is the gene expression matrix for the chosen gene set of interest, e.g. mitochondrial genes, gem.all
is the entire gene expression matrix, seed
is the output from FindSeed
and splits
is the number of groups to split the chosen gene set into in order to select the best group.
CCLE.cor.vec <- CVEval(gem.part = CCLE.mito, gem.all = CCLE_small, seed = CCLE.seed, splits = 10)
Using the calculated correlation vector, it is a relatively simple task to perform gene set enrichment. This can be done on any platform (e.g. DAVID, gprofiler, etc.) but MCbiclust comes with an inbuilt function for calculating GO enrichment values using the Mann-Whitney non-parametric test.
This is achieved with the GOEnrichmentAnalysis
function which takes three inputs:
gene.names
: The names of the genes in standard format.
gene.values
: The correlation vector.
sig.rate
: The value below which adjusted p-values are decided to be significant.
The output is a table with 7 columns:
GOID
: ID for GO term.
TERM
: Name of GO term.
num.genes
: Number of genes in GO term.
g.in.genelist
: Number of genes in GO term that were measured in the gene expression matrix.
adj.p.value
: Adjusted p-value from Mann-Whitney test.
CV.av.value
: Average value of CV for genes in GO term.
phenotype
: +1 if CV.av.value
is greater than the overall CV average, -1 if the CV.av.value
is less then the overall CV average.
GSE.MW <- GOEnrichmentAnalysis(gene.names = row.names(CCLE_small), gene.values = CCLE.cor.vec, sig.rate = 0.05)
GSE.MW <- MCbiclust:::Vignette_GSE
There are 76 significant terms and the top 10 most significant can be viewed below:
row.names(GSE.MW) <- NULL pander::pandoc.table(GSE.MW[1:10,],row.names = FALSE)
Since CCLE_small
is half made up of mitochondrial genes and we were seeking for mitochondrial related biclusters it is not surprising that mitochondrial terms dominate the gene set enrichment. If MCbiclust
was run on the full CCLE gene expression data set it would be expected to see many more significant non-mitochondrial related terms.
An alternative to using the GOEnrichmentAnalysis
function would be to use a separate gene set enrichment method such as gprofiler
, this can be done by for instance selecting the top 200 genes with positive CV values:
top200 <- row.names(CCLE_small)[order(CCLE.cor.vec, decreasing = TRUE)[seq(200)]] # top200.gprof <- gprofiler(top200) # dim(top200.gprof)
# pander::pandoc.table(top200.gprof[seq(10),-c(1,2,7,8,11,14)], # row.names = FALSE)
Already all the genes in the data set have had the correlation calculated to the pattern found. One more task that can be readily done is to order the samples according to the strength of correlation. Function FindSeed
found the initial $n$ samples that had a very strong correlation with the gene set of interest, the $n+1$ sample is to be selected as that sample which best maintains the correlation strength, this process can be simply repeated until all or the desired number of samples are ordered.
SampleSort
is the function in MCbiclust
that completes this procedure, it has $4$ main inputs:
gem
: the gene expression matrix with all the samples and the gene set of interest.
seed
: the initial subsample found with FindSeed
.
num.cores
: Used for setting the number of cores used in calculation, default value is to use one core.
sort.length
: Sets the number of samples to be ordered.
CCLE.samp.sort <- SampleSort(CCLE.mito[as.numeric(CCLE.hicor.genes),], seed = CCLE.seed)
Note as before that these are long calculations, and may take some time.
CCLE.samp.sort <- MCbiclust:::Vignette_sort[[1]]
Note that SampleSort is a very computationally expensive function and requires time to run. For a large dataset such as the CCLE data it is advisable to either calculate a partial ordering, which can be done with the sort.length
arguement or submit the job of sorting the samples to a high performance computing facility.
Once the samples have been sorted it is possible to summarise the correlation pattern found using principal component analysis (PCA).
PCA is a method of dimensional reduction, and converts a data set to a new set of variables known as the principal components. These are designed to be completely uncorrelated or orthogonal to each other. In this way the principal components are new variables that capture the correlations between the old variables, and are in fact a linear combination of the old variables. The first principal component (PC1) is calculated as the one that explains the highest variance within the data, the second than is that which has the highest variance but is completely uncorrelated or orthogonal to the previous principal component. In this way additional principal components are calculated until all the variance in the data set is explained.
PC1 captures the highest variance within the data, so if PCA is run on the found bicluster with very strong correlations between the genes, PC1 will be a variable that summarises this correlation.
PC1VecFun
is a function that calculates the PC1 values for all sorted samples. It takes three inputs:
1.top.gem
is the gene expression matrix with only the most highly correlated genes but with all the sample data.
seed.sort
is the sorting of the data samples found with function SampleSort
n
is the number of samples used for initially calculating the weighting of PC1. If set to $10$, the first $10$ samples are used to calculate the weighting of PC1 and then the value of PC1 is calculated for all samples in the ordering.
top.mat <- CCLE.mito[as.numeric(CCLE.hicor.genes),] pc1.vec <- PC1VecFun(top.gem = top.mat, seed.sort = CCLE.samp.sort, n = 10)
So far MCbiclust outputs a ranked list of genes and samples. In many cases it is however necessary to determine which genes and samples are within the bicluster and which are not. This is done with the ThresholdBic
function, which takes $4$ arguements:
cor.vec
: The correlation vector, output of CVeval
.
sort.order
: The sorted samples, output of SampleSort
.
pc1
: The PC1 vector, output of PC1VecFun
samp.sig
: A numeric value between 0 and 1 that detemines the number of samples in the bicluster.
The genes in the bicluster are determined using kmeans clustering, and dividing the genes into two clusters based on the absolute value of the correlation vector, choosing one correlated and one uncorrelated groups.
The samples are however chosen based on the last 10% of the ranked samples, these samples are assumed to not belong to the bicluster and the first sample with a PC1 value between the $0 + samp.sig/2$ and $1 - samp.sig/2$ quantiles, and every sample after that is not in the bicluster.
CCLE.bic <- ThresholdBic(cor.vec = CCLE.cor.vec, sort.order = CCLE.samp.sort, pc1 = pc1.vec, samp.sig = 0.05)
Once this thresholded bicluster has been found it is important to properly align the PC1 vector and the correlation vector such that samples with a high PC1 values are those samples with up-regulated genes that have positive CV values. This is not strictly necessary to do, but makes the interpretation of MCbiclust simpler.
This is done with function PC1Align
which if necessary times the pc1.vec by -1 to ensure that the correlation vector and PC1 vector are "aligned".
pc1.vec <- PC1Align(gem = CCLE_small, pc1 = pc1.vec, sort.order = CCLE.samp.sort, cor.vec = CCLE.cor.vec, bic = CCLE.bic)
As an alternative to calculating PC1, the user may want to calculate the average expression value of certain gene sets. This gives a better idea of the type of regulation occurring in the correlation pattern, as an abstract notion of a principal component does not have to be understood.
av.genes.group1 <- colMeans(CCLE.mito[CCLE.groups[[1]], CCLE.samp.sort]) av.genes.group2 <- colMeans(CCLE.mito[CCLE.groups[[2]], CCLE.samp.sort])
Once the samples have been ordered and PC1 and the average gene sets calculated it is a simple procedure to produce plots of these against the ordered samples.
One final additional thing that can be done is to classify the samples into belonging to the bicluster or not, and additionally whether a sample belongs to the Upper or Lower fork. This can be done with the function ForkClassifier
To produce the plots of the forks the ggplot2
package is used.
CCLE.names <- colnames(CCLE_small)[CCLE.samp.sort] fork.status <- ForkClassifier(pc1.vec, samp.num = length(CCLE.bic[[2]])) CCLE.df <- data.frame(CCLE.name = CCLE.names, PC1 = pc1.vec, Fork = fork.status, Average.Group1 = av.genes.group1, Average.Group2 = av.genes.group2, Order = seq(length = length(pc1.vec))) ggplot(CCLE.df, aes(Order,PC1)) + geom_point(aes(colour = Fork)) + ylab("PC1") ggplot(CCLE.df, aes(Order,Average.Group1)) + geom_point(aes(colour = Fork)) + ylab("Average Group 1") ggplot(CCLE.df, aes(Order,Average.Group2)) + geom_point(aes(colour = Fork)) + ylab("Average Group 2")
This by itself however is not particularly enlightening and to get additional information out of these plots supplementary information needs to be examined.
This section will deal with an addition data sets both of which are available in the MCbiclust package.
This section is meant as an example of the type of analysis that can be done with additional data set. Each new data set may have different additional data available with it and may be in formats that need some extra work to become compatible with the results from the MCbiclust
analysis.
This data set is available within the MCbiclust
package.
data(CCLE_samples)
In this case some samples have an additional "X" not present in some CCLE_samples data so it is necessary to add it for consistency.
CCLE.samples.names <- as.character(CCLE_samples[,1]) CCLE.samples.names[c(1:15)] <- paste("X",CCLE.samples.names[c(1:15)], sep="") CCLE_samples$CCLE.name <- CCLE.samples.names
The first step is to compare the column names of both data sets and to make sure we are dealing with the same correctly labeled samples.
rownames(CCLE_samples) <- as.character(CCLE_samples[,1]) CCLE.data.names <- colnames(CCLE_small) CCLE_small_samples <- CCLE_samples[CCLE.data.names,]
Using the dplyr
library, it is possible to join this new data set to the one we made for plotting the values of PC1 in the previous section. This can be easily done as both datasets share a column - the name of the samples. Once this is done, it is again simple to produce additional plots.
CCLE.df.samples <- inner_join(CCLE.df,CCLE_samples,by="CCLE.name") ggplot(CCLE.df.samples, aes(Order,PC1)) + geom_point(aes(colour=factor(Site.Primary))) + ylab("PC1")
In this case the figure is slightly confusing due to the number of factors. We can however rename factors that appear less than 30 times in total as "Other".
rare.sites <- names(which(summary(CCLE.df.samples$Site.Primary) < 15)) CCLE.df.samples$Site.Primary2 <- as.character(CCLE.df.samples$Site.Primary) rare.sites.loc <- which(CCLE.df.samples$Site.Primary2 %in% rare.sites) CCLE.df.samples$Site.Primary2[rare.sites.loc] <- "Other" ggplot(CCLE.df.samples, aes(Order,PC1)) + geom_point(aes(colour=factor(Site.Primary2))) + ylab("PC1")
ggplot(CCLE.df.samples, aes(Order,PC1)) + geom_point(aes(colour=factor(Gender))) + ylab("PC1")
Since in this case the data is categorical, it can be tested for significance using Pearson's chi squared test.
library(MASS) # create contingency tables ctable.site <- table(CCLE.df.samples$Fork, CCLE.df.samples$Site.Primary) ctable.gender <- table(CCLE.df.samples$Fork, CCLE.df.samples$Gender, exclude = "U") chisq.test(ctable.site) chisq.test(ctable.gender)
As was easily apparent from examining the plots, the primary site the cell line is derived from is highly significant, while gender is not.
MCbiclust
is a stochastic method so for best results it needs to be run multiple times, in practice this means using high-performance computing the run the algorithm on a computer cluster which will be dealt with in a later section. Here however the task of dealing with the results will be looked at. The algorithm will be run $100$ times with only $500$ iterations each. Typically more iterations are required, but for this demonstration it will be sufficient.
CCLE.multi.seed <- list() initial.seed1 <- list() for(i in seq(100)){ set.seed(i) initial.seed1[[i]] <- sample(seq(length = dim(CCLE_small)[2]),10) CCLE.multi.seed[[i]] <- FindSeed(gem = CCLE_small[c(501:1000), ], seed.size = 10, iterations = 500, initial.seed = initial.seed1[[i]]) }
CCLE.multi.seed <- MCbiclust:::Vignette_multi_seed initial.seed1 <- MCbiclust:::Vignette_initial_seed
The associated correlation vector must also be calculated for each run and these correlation vectors can be put into a matrix.
CCLE.cor.vec.multi <- list() for(i in seq(100)){ CCLE.cor.vec.multi[[i]] <- CVEval(gem.part = CCLE_small[c(501:1000), ], gem.all = CCLE_small, seed = CCLE.multi.seed[[i]], splits = 10) }
CCLE.cor.vec.multi <- MCbiclust:::Vignette_multi_cv
len.a <- length(CCLE.cor.vec.multi[[1]]) len.b <- length(CCLE.cor.vec.multi) multi.run.cor.vec.mat <- matrix(0,len.a,len.b) for(i in 1:100){ multi.run.cor.vec.mat[,i] <- CCLE.cor.vec.multi[[i]] } rm(CCLE.cor.vec.multi)
A correlation matrix can be formed from the correlation vectors, and in this way they can be viewed as a heatmap.
CV.cor.mat1 <- abs(cor((multi.run.cor.vec.mat))) cor.dist <- function(c){as.dist(1 - abs(c))} routput.corvec.matrix.cor.heat <- heatmap.2(CV.cor.mat1, trace="none", distfun = cor.dist)
It needs to be known how many distinct patterns have been found, this is done with clustering and particular silhouette coefficients to judge what number of clusters is optimum within the data. Function SilhouetteClustGroups
achieves this and uses hierarchical clustering to split the patterns into clusters, for comparison a randomly generated correlation vector is also added to allow for the possibility that all patterns found are best grouped into a single cluster.
multi.clust.groups <- SilhouetteClustGroups(multi.run.cor.vec.mat, max.clusters = 20, plots = TRUE,rand.vec = FALSE)
Here two clusters were found, and we can visualise this pattern (and any additional others found) with the function CVPlot
, which highlights a chosen gene set, in this case the mitochondrial genes.
gene.names <- row.names(CCLE_small) av.corvec.fun <- function(x) rowMeans(multi.run.cor.vec.mat[,x]) average.corvec <- lapply(X = multi.clust.groups, FUN = av.corvec.fun) CVPlot(cv.df = as.data.frame(average.corvec), geneset.loc = mito.loc, geneset.name = "Mitochondrial", alpha1 = 0.1)
As before can also calculate the gene set enrichment.
GOfun <- function(x) GOEnrichmentAnalysis(gene.names = gene.names, gene.values = x, sig.rate = 0.05)
corvec.gsea <- lapply(X = average.corvec, FUN = GOfun)
corvec.gsea <- MCbiclust:::Vignette_multi_gsea
Before using SampleSort
a special prep function, MultiSampleSortPrep
is used to generate the gene expression matrix and top seed for each found bicluster. The gene expression matrix is composed of the top $n$ genes in the correlation vector, and the seed is chosen as the calculated seed that has the maximum correlation score.
CCLE.samp.multi.sort <- list() multi.prep <- MultiSampleSortPrep(gem = CCLE_small, av.corvec = average.corvec, top.genes.num = 750, groups = multi.clust.groups, initial.seeds = CCLE.multi.seed)
CCLE.samp.multi.sort[[1]] <- SampleSort(gem = multi.prep[[1]][[1]], seed = multi.prep[[2]][[1]]) CCLE.samp.multi.sort[[2]] <- SampleSort(gem = multi.prep[[1]][[2]], seed = multi.prep[[2]][[2]])
Note as before that these are long calculations.
CCLE.samp.multi.sort <- list() CCLE.samp.multi.sort[[1]] <- MCbiclust:::Vignette_sort[[2]][[1]] CCLE.samp.multi.sort[[2]] <- MCbiclust:::Vignette_sort[[2]][[2]]
These two biclusters can now be analysed in the same way as the single bicluster before.
To calculate the PC1 values:
pc1.vec.multi <- list() pc1.vec.multi[[1]] <- PC1VecFun(top.gem = multi.prep[[1]][[1]], seed.sort = CCLE.samp.multi.sort[[1]], n = 10) pc1.vec.multi[[2]] <- PC1VecFun(top.gem = multi.prep[[1]][[2]], seed.sort = CCLE.samp.multi.sort[[2]], n = 10)
These new biclusters can also be thresholded as follows:
CCLE.bic.multi <- list() CCLE.bic.multi[[1]] <- ThresholdBic(cor.vec = average.corvec[[1]], sort.order = CCLE.samp.multi.sort[[1]], pc1 = pc1.vec.multi[[1]], samp.sig = 0.05) CCLE.bic.multi[[2]] <- ThresholdBic(cor.vec = average.corvec[[2]], sort.order = CCLE.samp.multi.sort[[2]], pc1 = pc1.vec.multi[[2]], samp.sig = 0.05) pc1.vec.multi[[1]] <- PC1Align(gem = CCLE_small, pc1 = pc1.vec.multi[[1]], sort.order = CCLE.samp.multi.sort[[1]], cor.vec = average.corvec[[1]], bic = CCLE.bic.multi[[1]]) pc1.vec.multi[[2]] <- PC1Align(gem = CCLE_small, pc1 = pc1.vec.multi[[2]], sort.order = CCLE.samp.multi.sort[[2]], cor.vec = average.corvec[[2]], bic = CCLE.bic.multi[[2]])
In a similar way to before the forks for these new biclusters can be plotted:
CCLE.multi.df <- data.frame(CCLE.name = colnames(CCLE_small), Bic1.order = order(CCLE.samp.multi.sort[[1]]), Bic2.order = order(CCLE.samp.multi.sort[[2]]), Bic1.PC1 = pc1.vec.multi[[1]][order(CCLE.samp.multi.sort[[1]])], Bic2.PC1 = pc1.vec.multi[[2]][order(CCLE.samp.multi.sort[[2]])]) CCLE.multi.df.samples <- inner_join(CCLE.multi.df,CCLE_samples,by="CCLE.name") rare.sites <- names(which(summary(CCLE.multi.df.samples$Site.Primary) < 15)) CCLE.multi.df.samples$Site.Primary2 <- as.character(CCLE.multi.df.samples$Site.Primary) rare.sites.loc <- which(CCLE.multi.df.samples$Site.Primary2 %in% rare.sites) CCLE.multi.df.samples $Site.Primary2[rare.sites.loc] <- "Other" ggplot(CCLE.multi.df.samples, aes(Bic1.order,Bic1.PC1)) + geom_point(aes(colour=factor(Site.Primary2))) + ylab("Bic1 PC1") ggplot(CCLE.multi.df.samples, aes(Bic2.order,Bic2.PC1)) + geom_point(aes(colour=factor(Site.Primary2))) + ylab("Bic2 PC1")
One final thing that can be done is to compare all 3 correlation vectors found
cv.df <- as.data.frame(average.corvec) cv.df$Mito1 <- CCLE.cor.vec CVPlot(cv.df,cnames = c("R1","R2","M1"), geneset.loc = mito.loc, geneset.name = "Mitochondrial", alpha1 = 0.1)
It is immediately apparent that the one of the biclusters found from the random gene set is very similar to that of the mitochondrial based bicluster.
Once a bicluster has been identified it might be desired to find samples that have the same regulation pattern in different data sets. One option would be to run MCbiclust independently on these data sets and attempt to find a similar bicluster based on the correlation vector values. This however is not always practical, for example if the data set in question contains relatively few samples.
There are however a few possible methods to score samples based on a known correlation pattern which are described here.
The PointScore can be calculated using the CCLE.mito data set as follow, this utilises the differently regulated group uncovered with HclustGenesHiCor
and the dendrogram produced from using heatmap.2
.
gene.loc1 <- which(row.names(CCLE.mito[CCLE.hicor.genes,]) %in% CCLE.groups[[1]]) gene.loc2 <- which(row.names(CCLE.mito[CCLE.hicor.genes,]) %in% CCLE.groups[[2]]) CCLE.ps <- PointScoreCalc(CCLE.mito[CCLE.hicor.genes,], gene.loc1, gene.loc2)
The PointScore can be directly compared with the PC1 values
CCLE.df$PointScore <- CCLE.ps[CCLE.samp.sort] ggplot(CCLE.df, aes(Order,PC1)) + geom_point(aes(colour = Fork)) + ylab("PC1") ggplot(CCLE.df, aes(Order,PointScore)) + geom_point(aes(colour = Fork)) + ylab("PointScore")
For single samples an alternative method must be used. One described in literature is that of single sample GSEA or ssGSEA (@barbie2009systematic). This can be calculated with the R package GSVA
(@hanzelmann2013gsva), the ssGSEA score of the two gene sets can be calculated, the final ssGSEA of these two groups combines is the mean of the first group and the negative value of the second group.
library(GSVA) ssGSEA.test <- gsva(expr = as.matrix(CCLE.mito[CCLE.hicor.genes,]), gset.idx.list = CCLE.groups, method = 'gsva', parallel.sz = 1) ssGSEA.test[2,] <- -ssGSEA.test[2,] CCLE.ssGSEA <- colMeans(ssGSEA.test)
Similarly to the PointScore it can be directly compared to the PC1 values.
CCLE.df$ssGSEA <- CCLE.ssGSEA[CCLE.samp.sort] ggplot(CCLE.df, aes(Order, PC1)) + geom_point(aes(colour = Fork)) + ylab("PC1") ggplot(CCLE.df, aes(Order, ssGSEA)) + geom_point(aes(colour = Fork)) + ylab("ssGSEA")
Note that while the PointScore created a very clean fork, the ssGSEA was much noisier. In general the bigger the data set the more accurate the PointScore will be, while ssGSEA is useful for analysing very small datasets or even single samples.
devtools::session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.