Summary

This vignette provides examples of how to use the package EpiCluster to perform unsupervised clustering and dimensional reduction of a (large) DNA methylation data matrix [@Tian2015]. In the EpiCluster package a variant of a non-negative matrix factorization (NMF) method, called Beta-Gamma NMF (bgNMF) [@Ma2014b], is used to perform the inference of latent variables in a manner which preserves the bounded (beta-valued) nature of the DNA methylation data. This unique feature of the bgNMF method, allows it to be combined with other tools such as the Recursive-Partitioning Beta Mixture Model (RPBMM) [@Houseman2008]. Specifically, RPBMM is very time-consuming on large DNA methylation datasets. By using bgNMF, we can dimensionally reduce the large data matrix into a smaller manageable number of beta-valued latent variables, over which we can subsequently apply RPBMM. Since bgNMF speeds up the dimensional reduction, the combined application of bgNMF+RPBMM achieves significant reductions in running times compared to the sole application of RPBMM. EpiCluster has been tested on several real DNA methylation datasets. Besides unsupervised clustering, one may use this package to detect latent variables, perform dimensional reduction as well as feature selection.

In this package, two S4 objects are defined to store the results of bgNMG and EpiCluster. The vignette describes the whole process of using EpiCluster, from the initial dimensional reduction and clustering, to the detection of interesting/significant latent variables, as well as the CpGs driving each variable. The package also provides a user-friendly function which can be used to generate artificial simulated data. In this vignette, we showcase the application of EpiCluster to both simulated as well as real data (GSE40279) [@Hannum2013].

Note that EpiCluster has several package dependencies, which need to be installed beforehand. These are: RPMM, Rlab, isva, pheatmap. One can use the following code to install them:

install.package("RPMM")
install.package("Rlab")
install.package("isva")
install.package("pheatmap")

And you can install EpiCluster in many ways: one can download our EpiCluster package from [github]{https://github.com/JoshuaTian/EpiCluster} and use code below in bash:

R CMD INSTALL EpiCluster_1.0.tar.gz

After installing, in the future you can load the package as:

library("EpiCluster")

and all dependencies will be loaded automatically. In this package, there are 7 functions included: GenSimData(), BGNMF(), EpiCluster(), EpiAnalysis(), NumericAnalysis(), CategoricalAnalysis and EpiDraw(). Normally, GenSimData() will not be used since its function is to generate simulated data. BGNMF() is the core function of EpiCluster, which should also not be modified. EpiAnalysis() function would employ NumericAnalysis() and CategoricalAnalysis() functions automatically. BGNMF() will return a list, which will be incorporated into the EpiCluster.output S4 object and is therefore accessible by the user. EpiCluster is the main user function and will perform the dimensional reduction using bgNMF and subsequent clustering with RPBMM. EpiAnalysis() uses the output from EpiCluster() to help find associations between the latent variables inferred using bgNMF and specific phenotypes. Similarly, EpiDraw() takes the EpiCluster output and uses it to generate a heatmap of the inferred clusters. It also allows phenotype information as input.

Simulation Data

The function GenSimData() can be used to generate some artificial simulated beta-valued data matrix. The user can assign the numbers of CpGs, the numbers different phenotypes, numbers of samples in each class/phenotype and the numbers of significantly differentially methylated CpGs (DMCs). We will use library pheatmap to represent the artificial simulation data generated by this function below.

# Here we generated a Artificial Simulation Data, which contain 10000 CpGs and 4 classes/phenotypes. Each class contains 20 samples, and 1000 class-specific DMCs will be generated.
SimData <- GenSimData(Ncpg=10000,Nsig=1000,Npheno=4,Nsample=20)
library("pheatmap")
pheatmap(SimData$beta)

Here we present three simulated datasets containing 3, 5, and 7 classes:

knitr::include_graphics("./Figure1.pdf")

EpiCluster

EpiCluster calls the core BGNMF() function and further uses RPBMM to cluster the basis matrix inferred from BGNMF(). The X parameter in EpiCluster(), as well as Data parameter in BGNMF() function, could also be ExpressionSet object, thus EpiCluster() and BGNMF() would automatically extract value matrix from it. There are many parameters that can be modified to control the EpiCluster output. The most salient parameters are:

We note that NA values are NOT allowed in the DNA methylation dataset, so NA's need to be removed or imputed beforehand. An example of an EpiCluster run is shown below on simulated data:

SimData <- GenSimData(Ncpg=1000,Nsig=100,Npheno=4,Nsample=30)
# Here we conducted EpiCluster on simulation generated above, we set number of iteration as 20, and EE parameter as 0.005, which is relatively big so more significant CpGs shall be find after EpiCluster. We did not assigned K value here so EpiCluster will automaticly detect number of latent viariables, based on Random Matrix Theory in isva pacakge.
EpiCluster.Result <- EpiCluster(SimData$beta,nIter=20,EE=0.005)

[1] "Iteration times:  20"
[1] "Components number:  4"
[1] "ee parameter:  0.005"
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] 18
[1] 19
[1] 20
[1] "Run BGNMF success!"
[1] "root"
[1] Inf
[1] 0.06998368
[1] 0.05706815
...
[1] 3.154375e-07
[1] 365.676
[1] "rL"
[1] Inf
[1] 0.06250856
[1] 0.005655091
[1] 2.249948e-09
[1] 368.4922
[1] "rLL"
[1] Inf
[1] 0.006686889
[1] 0.0100536
...
[1] 0.0002950578
[1] 208.3635
[1] "rLR"
[1] Inf
[1] 0.00893239
[1] 0.009745804
...
[1] 2.059788e-05
[1] 216.4498
[1] "rR"
[1] Inf
[1] 0.06133592
[1] 5.436303e-05
[1] 0
[1] 380.732
[1] "rRL"
[1] Inf
[1] 0.002652312
[1] 0.003085925
...
[1] 0.00085122
[1] 218.0377
[1] "rRR"
[1] Inf
[1] 0.004746239
[1] 0.007349823
...
[1] 217.5082
[1] "Run RPBMM success!"
The simulated data looks like below:
knitr::include_graphics("./Figure2.pdf")

Here we performed only 20 iterations with an EE value of 0.005. Let us now inspect the output of EpiCluster:

slotNames(EpiCluster.Result)
[1] "bgNMF" "Cluster" "betaRPMM"

Here 'bgNMG' is the output of the bgNMF algorithm, which contains the EE value used, the A, B and H matrices, which define the basis and excitation matrices, as well as X, which is the reconstructed data matrix. 'Cluster' is the clustering output label of the samples obtained by the application of RPBMM to the basis matrix inferred with bgNMF. 'betaRPMM' contains all the output from RPBMM, which is a blcTree S3 object. The user may turn to the RPMM pacakge to get more information about this object.

EpiAnalysis

After running EpiCluster, a typical task is to explore if there are correlations between the inferred clusters or latent variables and specific phenotypes. To this end we provide the EpiAnalysis() function, which can do two kinds of analyses:

The first analysis is to test if clusters from EpiCluster correlate with phenotypes. There is one important parameter threshold in this function, which controls the minimum size of clusters to be used in the analysis. Thus, clusters with less samples than this threshold will be ignored. For the remaining clusters, EpiAnalysis will conduct a Krustal-Wallis test and ANOVA Test for numeric covariates on estimated clusters. For categorical covariates, a Chisquare test will be conducted between covariates and clusters. An optional parameter maxlevel also allows the user to redefine the optimal number of clusters, for instance, assuming that all inferred clusters have a size smaller than the threshold.

In the second analysis, it will correlate each of the inferred latent variables to each covariate. P-values are estimated using a linear model. Thus, this returns a matrix of P-values with with rows labeling the components/latent variable and column labeling the covariate.

EpiAnalysis function also automatically employ NumericAnalysis() and CategoricalAnalysis() function to achieve analysis.

EpiAnalysis function will output many analysis result on screen, if you have many covariates in your Phenotypes file, it maybe hard to go though all result, thus we recommend user use sink() function to collect all output into a file.

# Here we only input one covariate into Phenotypes parameter because it's not easy to contruct multi-covariate simulation data. We will show a more comprehensive samples later. Also, in this example, we ignored parameter maxlevel here, which means all clusters from EpiCluster will be analysed.
EpiAnalysis.Result <- EpiAnalysis(EpiCluster.Result,PhenoTypes=SimData$pheno.v,threshold=10)

========= Only one Covariate in PhenoType =========
== All Analysis Result will be returned in output ==
--------------------- START ------------------------

=============== EpiCluster Result ===============
bgNMF Detected 4 Components.
bgNMF Detected 4 Clusters under maxlevel Inf .
There are 4 Clusters contain more than 10 Samples in it.

---------------------------------------
PhenoType is a categorical covariate.
     (1) ANOVA Test will be conducted between this covariate and 4 estimated components each.
     (2) Chisquare Test will be conducted for 4 clusters contain more than 10 samples.
---------------------------------------

(1) ANOVA test between each components to PhenoType
There are 4 Components show significance to PhenoType :

(2) Chisquare Test for 4 clusters on PhenoType
---------------------------------------
After Filering, there are 4 clusters(based on maxlevel Inf ) contain 10 or more Samples, we will only do Chisquare Test on these 4 clusters. Corresponding to    these clusters, we get 120 samples.
---------------------------------------
Chisquare Test:
                                  Clusters_Above_Threshold
Covariate_Corresponding_To_Cluster rLLL rLLR rLR rR
                                 1   30    0   0  0
                                 2    0    0   0 30
                                 3    0    0  30  0
                                 4    0   30   0  0

        Pearson's Chi-squared test

data:  table(Covariate_Corresponding_To_Cluster, Clusters_Above_Threshold)
X-squared = 360, df = 9, p-value < 2.2e-16

Let's take this discrete covariate into example. First, EpiAnalysis will detect if Phenotypes parameter received a list or a vector, then it will return numbers of Components and number of clusters estimated, also it will indicate how many clusters contain numbers of samples above threshold. First, EpiAnalysis wil calculate ANOVA test on each component to see if they separated this covariate. Than, it will show the contingency table of clusters and covariate and the Chisquared Test result of all clusters. After that, EpiAnalysis function will calculate Chisquare Test on this contingency table.

EpiDraw

This constitutes the last step of the EpiCluster package. The EpiDraw() function can help users draw nice plots of the basis matrix and clustering result. Again, maxlevel and threshold are two important parameters but can be ignored by using the default settings. Just as with the EpiAnalysis function, user may input multiple covariates as a list into a Phenotypes object, with all covariates being either a "factor", "character" or "numeric" variable in R.

EpiDraw(EpiCluster.Result,PhenoTypes=SimData$pheno.v)
knitr::include_graphics("./Figure4.pdf")

Here we can see the basis matrix, which contains 4 estimated components, is at the top of the plot, each row representing one estimated component and each columns representing one sample. Samples are ranked based on clustering result (the bottom color bar). Categorical covariates are shown as color bars on top. Below the heatmap, legend for both phenotype and clustering result are labeled. The barplot for each cluster, showing the distribution of the phenotype among the inferred clusters is shown below (for continuous covariates like age, this will be a boxplot). The colors in each bar (or boxplot) are exactly the same as color on phenotype color bar above heatmap. Thus, users can clearly infer the clustering result and classification effect of EpiCluster package. Note that, if the Phenotypes input object in this function is a list contain various covariates, each of these covariates will be plotted as a boxplot or barplot one by one (the function automatically detects which type of covariate we have).

Real Data Example

We here employ a real dataset (GSE40279) to further illustrate the use of the EpiCluster package. In this dataset there are 5 covariates in the phenotype object: Age, Source, Plate, Sex and Ethnicity. With the exception of age, all other 4 covariates are of a discrete/categorical nature. We are going to use it to demonstrate EpiAnalysis and EpiDraw function. Since GSE40279 is a 450K dataset, which contains too much CpG sites, if we directly run it with EpiCluster, it may cost more than 48 hours, thus we applied SVD to select most informative CpGs. After applying SVD, V matrix is a K$*$P matrix, where K represent latent variables and P represent CpGs, we select all top 5\% ranked CpGs in each components, and combined them into a unique CpG dataset. For 27K dataset, there is no need to do informative CpG selection.

```{bash eval=FALSE}

library("EpiCluster") ls()

[1] "Data" "PhenoTypes.lv"

```{bash eval=FALSE}
# There are totally 656 samples, we randomly print 20 to show Phenotypes for this dataset.
> PhenoTypes.lv[sample(656,20),]
          Age Source Plate Sex                 Ethn
GSM990392  55    USC     6   F   Hispanic - Mexican
GSM990037  63   Utah     3   F Caucasian - European
GSM989835  73   UCSD     1   F Caucasian - European
GSM989869  77   UCSD     1   F Caucasian - European
GSM989909  65   UCSD     1   F Caucasian - European
GSM990546  89   UCSD    11   M Caucasian - European
GSM989908  65   UCSD     1   F Caucasian - European
GSM990625  68   UCSD    11   M Caucasian - European
GSM990005  84   Utah     3   M Caucasian - European
GSM990520  47   UCSD     9   F   Hispanic - Mexican
GSM990286  58    USC     5   F   Hispanic - Mexican
GSM989846  60   UCSD     1   F Caucasian - European
GSM990393  47    USC     6   M   Hispanic - Mexican
GSM990589  78   UCSD    11   M Caucasian - European
GSM990430  57   UCSD     8   M Caucasian - European
GSM990303  54    USC     5   F   Hispanic - Mexican
GSM990304  60    USC     5   M   Hispanic - Mexican
GSM990490  33   UCSD     9   M   Hispanic - Mexican
GSM989921  85   Utah     2   M Caucasian - European
GSM990049  81   Utah     3   M Caucasian - European

For sake of demonstration, we next run EpiCluster on an informative subset of CpGs (for instance these could be selected according to variance across the samples):

```{bash eval=FALSE} EpiCluster.Result <- EpiCluster(Data,EE=0.05,nIter=60) [1] "Iteration times: 60" [1] "Components number: 76" [1] "ee parameter: 0.05" [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 ...

Then, we run EpiAnalysis on EpiCluster Result:


```{bash eval=FALSE}
EpiAnalysis.Result <- EpiAnalysis(EpiCluster.Result,PhenoTypes=PhenoTypes.lv,maxlevel=3,threshold=10)

For list of Phenotypes contain multiple covariate, a p value matrix will fisrtly be generated and plotted. Then scatter plot will be generated between eacn components and age, also dot histograms will be generated between each Components and Plate, Source, Sex and Ethn. Also EpiAnalysis function output a lot of information based on different covariates:

```{bash eval=FALSE} ======= More Than one Covariate in PhenoType ====== === All covariate will be Analysis one by one === == All Analysis Result will be returned in output ==

--------------------- START ------------------------ =============== EpiCluster Result =============== bgNMF Detected 76 Components. bgNMF Detected 4 Clusters under maxlevel: 3 . There are 4 Clusters contain more than 10 Samples in it.

-------- Start Analysis for Each Covariate --------

>>>>>> Age <<<<<<

Age is a numeric covariate. (1) Correlation Test will be conducted between this covariate and 76 estimated components each. (2) Krustal Test/ANOVA will be conducted for 4 clusters contain more than 10 samples.


(1) Correlation between each components to Age There are 45 Components significantly Correlate with Age :

(2) Krustal Test/ANOVA for 4 clusters on Age

After Filering, there are 4 clusters(based on maxlevel 3 ) contain 10 or more Samples, we will only do Krustal Test and ANOVA on these 4 clusters. Corresponding to these clusters, we get 656 samples.

Krustal Test:

Kruskal-Wallis rank sum test

data: Clusters_Above_Threshold and Covariate_Corresponding_To_Cluster Kruskal-Wallis chi-squared = 89.99, df = 73, p-value = 0.08629

ANOVA Test: Call: aov(formula = Covariate_Corresponding_To_Cluster ~ Clusters_Above_Threshold)

Terms: Clusters_Above_Threshold Residuals Sum of Squares 9299.04 132947.15 Deg. of Freedom 3 652

Residual standard error: 14.27959 Estimated effects may be unbalanced Df Sum Sq Mean Sq F value Pr(>F)
Clusters_Above_Threshold 3 9299 3099.7 15.2 1.43e-09 *** Residuals 652 132947 203.9


Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1

>>>>>> Source <<<<<<

Source is a categorical covariate. (1) ANOVA Test will be conducted between this covariate and 76 estimated components each. (2) Chisquare Test will be conducted for 4 clusters contain more than 10 samples.


(1) ANOVA test between each components to Source There are 48 Components show significance to Source :

(2) Chisquare Test for 4 clusters on Source

After Filering, there are 4 clusters(based on maxlevel 3 ) contain 10 or more Samples, we will only do Chisquare Test on these 4 clusters. Corresponding to these clusters, we get 656 samples.

Chisquare Test: Clusters_Above_Threshold Covariate_Corresponding_To_Cluster rLL rLR rRL rRR Boston 6 9 13 7 UCSD 52 73 83 96 USC 8 14 47 70 Utah 48 68 38 24

Pearson's Chi-squared test

data: table(Covariate_Corresponding_To_Cluster, Clusters_Above_Threshold) X-squared = 87.567, df = 9, p-value = 5.004e-15


>>>>>> Plate <<<<<<

Plate is a categorical covariate. (1) ANOVA Test will be conducted between this covariate and 76 estimated components each. (2) Chisquare Test will be conducted for 4 clusters contain more than 10 samples.


(1) ANOVA test between each components to Plate There are 65 Components show significance to Plate :

(2) Chisquare Test for 4 clusters on Plate

After Filering, there are 4 clusters(based on maxlevel 3 ) contain 10 or more Samples, we will only do Chisquare Test on these 4 clusters. Corresponding to these clusters, we get 656 samples.

Chisquare Test: Clusters_Above_Threshold Covariate_Corresponding_To_Cluster rLL rLR rRL rRR 1 26 41 7 16 2 20 39 18 7 3 23 26 9 11 5 11 19 33 27 6 6 5 29 52 8 6 8 36 31 9 4 4 19 21 10 2 2 9 4 11 16 20 21 28

Pearson's Chi-squared test

data: table(Covariate_Corresponding_To_Cluster, Clusters_Above_Threshold) X-squared = 184.17, df = 24, p-value < 2.2e-16


>>>>>> Sex <<<<<<

Sex is a categorical covariate. (1) ANOVA Test will be conducted between this covariate and 76 estimated components each. (2) Chisquare Test will be conducted for 4 clusters contain more than 10 samples.


(1) ANOVA test between each components to Sex There are 18 Components show significance to Sex : (2) Chisquare Test for 4 clusters on Sex


After Filering, there are 4 clusters(based on maxlevel 3 ) contain 10 or more Samples, we will only do Chisquare Test on these 4 clusters. Corresponding to these clusters, we get 656 samples.

Chisquare Test: Clusters_Above_Threshold Covariate_Corresponding_To_Cluster rLL rLR rRL rRR F 55 68 101 114 M 59 96 80 83

Pearson's Chi-squared test

data: table(Covariate_Corresponding_To_Cluster, Clusters_Above_Threshold) X-squared = 11.637, df = 3, p-value = 0.008738


>>>>>> Ethn <<<<<<

Ethn is a categorical covariate. (1) ANOVA Test will be conducted between this covariate and 76 estimated components each. (2) Chisquare Test will be conducted for 4 clusters contain more than 10 samples.


(1) ANOVA test between each components to Ethn There are 45 Components show significance to Ethn :

(2) Chisquare Test for 4 clusters on Ethn

After Filering, there are 4 clusters(based on maxlevel 3 ) contain 10 or more Samples, we will only do Chisquare Test on these 4 clusters. Corresponding to these clusters, we get 656 samples.

Chisquare Test: Clusters_Above_Threshold Covariate_Corresponding_To_Cluster rLL rLR rRL rRR Caucasian - European 93 136 100 97 Hispanic - Mexican 21 28 81 100

Pearson's Chi-squared test

data: table(Covariate_Corresponding_To_Cluster, Clusters_Above_Threshold) X-squared = 65.963, df = 3, p-value = 3.121e-14


```r
names(EpiAnalysis.Result)
[1] "PMatrix"  "Analysis"

And EpiAnalysis will return analysis between components and each covariates, for example, in this case, there are totally 76 components(latent variables) detected by Random Matrix Theory, and there are totally 5 covariate. Thus there will be a 76$*$5 matrix(PMatrix) returned, between each components and each covariate. All analysis result between clusters and covariate will be saved in object Analysis as well.

Finally, we run EpiDraw() on EpiCluster Result to show boxplot and barplot for covariates:

EpiDraw(EpiCluster.Result,PhenoTypes=PhenoTypes.lv,maxlevel=3,threshold=10)
knitr::include_graphics("./Figure7.pdf")
knitr::include_graphics("./Figure8.pdf")

Note that for continues numeric phenotype or multiple-covariate list phenotype, no colorbar above heatmap will be generated. As we can see, four-color boxplots are corresponding to clustering colorbar under the heatmap, user could easily detect the correlation and relationship between samples and clustering result.

References



JoshuaTian/EpiCluster documentation built on May 20, 2019, 10:19 p.m.