knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

MISKmeans is short for meta-analytic integrative sparse Kmeans, which aims to integrate multiple omics data, multiple cohort data and incorporate prior biological knowledge to perform feature selection and sample clustering (discovery disease subtypes). MISKmeans R package can be found on GitHub page https://github.com/Caleb-Huo/MIS-Kmeans. Figure 1 below shows the layout for such two-way integration.

Figure 1, layout for omics two-way integration.

The input should be multiple cohorts, with each cohort containing one or more omics profiles (e.g. gene expression, CNV, methylation...). The features (genes) from one or more omics profiles are group by prior biological knowledge. Usually the prior biological knowledge could be the cis regulatory relationship between mRNA, methylation and CNV of the same gene symbol; or a biological pathway including a collection of functional genes which can potentially overlap with other pathways. An illustration of such group in is shown in Figure 2. Figure 2(A): A group contains gene expression, CNV and methylation of the same gene symbol. Figure 2(B): A group is a pathway, which is a collection of genes (e.g. cell cycle pathway).}

Figure 2, group structure.

The MISKmeans algorithm is an extension from previous works meta sparse Kmeans (metaSparseKmeans) https://github.com/Caleb-Huo/MetaSparseKmeans and integrative sparse Kmeans (ISKmeans) https://github.com/Caleb-Huo/IS-Kmeans. The MetaSparseKmeans is about combining multiple cohorts of the same omics data type (e.g. gene expression data from multiple cohorts as shown in the red dashed rectangles I in the figure above) and perform a joint feature selection and sample clustering. The ISKmeans is about integrating multiple levels of omics data (e.g. multiple levels of omics data of the same patient cohort as shown in the green dashed rectangles II in the figure above). The MISKmeans is a combination of MetaSparseKmeans and ISKmeans, which is among the first to acheive two-way omics data integration.

In the next section, we will provide a concrete example how to use MISKmeans package. The step by step procedure will guide the users how to prepare the data and how to use the package.

Usage of MISKmeans package

Outline

We will use the multi omics profile of TCGA breast cancer data as an illustrating example to show how to utilize this MISKmeans package. The following steps are included.

  1. What is the input data format.
  2. An example on the integration of gene expression data + CNV data.
  3. An example on the integration of gene expression data + methylation data.
  4. An example on the integration of gene expression data + CNV + methylation data.
  5. Other issues on number of clusters and tuning parameter selection.

Input data format

The input data should be multi-cohort, multiple levels of omics data, as the format of data matrix, with common sample names for each layer of omics data. In this example, the features names in different omics level overlap, which defines the grouping information. For example, the expression profile, CNV profile and methylation profile all contain gene "A2M", these three features from different omics profile will form a group representing the cis-regulatory relationship of the same gene symbol. We will demonstrate how to prepare such group information to feed the MISKeamns in later examples. Another commonly used group information is biological pathway information, where a group of gene features are formed by a pathway name. In this vignette, we will integrate gene expression, CNV and methylation of TCGA breast cancer dataset from two studies. A simplified version of these omics profiles can be accessed by data(BRCA_Expr_S1), dim(BRCA_CNV_S1) and dim(BRCA_Meth_S1) for the first study; data(BRCA_Expr_S2), dim(BRCA_CNV_S2) and dim(BRCA_Meth_S2) for the second study.

library(MISKmeans)
data("BRCA_Expr_S1") ## gene expression profile of study 1
data("BRCA_Expr_S2") ## gene expression profile of study 2
data("BRCA_CNV_S1") ## copy number variation profile of study 1
data("BRCA_CNV_S2") ## copy number variation profile of study 2
data("BRCA_Meth_S1") ## DNA methylation profile of study 1
data("BRCA_Meth_S2") ## DNA methylation profile of study 2

dim(BRCA_Expr_S1)
dim(BRCA_Expr_S2)
dim(BRCA_CNV_S1)
dim(BRCA_CNV_S2)
dim(BRCA_Meth_S1)
dim(BRCA_Meth_S2)

For study 1, there are 100 common subjects for each of the three layers of omics data. For study 2, there are 100 common subjects for each of the three layers of omics data. Across studies, study 1 and study 2 have the common gene symbols in each layer of omics data. Further, several these features are grouped together by the same gene symbol. How to prepare such groups in R will be introduced shortly in this vignette. The following code is to check the sample names are all the same.

## check the sample names for study 1
all(colnames(BRCA_Expr_S1) == colnames(BRCA_CNV_S1))
all(colnames(BRCA_Expr_S1) == colnames(BRCA_Meth_S1))

## check the sample names for study 2
all(colnames(BRCA_Expr_S2) == colnames(BRCA_CNV_S2))
all(colnames(BRCA_Expr_S2) == colnames(BRCA_Meth_S2))

## check feature names are consistent across studies.
all(rownames(BRCA_Expr_S1) == rownames(BRCA_Expr_S2))
all(rownames(BRCA_CNV_S1) == rownames(BRCA_CNV_S2))
all(rownames(BRCA_Meth_S1) == rownames(BRCA_Meth_S2))

An example on the integration of gene expression data + CNV data.

We first need to combine the profiles of Expr and CNV.

study1_combineEC <- rbind(BRCA_Expr_S1, BRCA_CNV_S1)
study2_combineEC <- rbind(BRCA_Expr_S2, BRCA_CNV_S2)
data_combineEC <- list(t(study1_combineEC), t(study2_combineEC))

Second we need to prepare the grouping information between Expr and CNV. The grouping information should be input as a list:

omicsTyps_EC <- c(rep('gene',nrow(BRCA_Expr_S1)),rep('CNV',nrow(BRCA_CNV_S1)))
featureNames_EC <- c(rownames(BRCA_Expr_S1),rownames(BRCA_CNV_S1))
groups_EC <- split(1:length(featureNames_EC),featureNames_EC)

Third step is to perform MISKmeans

set.seed(15213)
res_combineEC <- MISKmeans(data_combineEC,K=3,group=groups_EC,gamma=0.7, silent=T)
table(res_combineEC$ws != 0)
res_combineEC$Cs

ws_gene <- res_combineEC$ws[omicsTyps_EC=="gene"]
table(ws_gene!=0)
ws_CNV <- res_combineEC$ws[omicsTyps_EC=="CNV"]
table(ws_CNV!=0)

visualization for the gene expression profile and CNV profile

## generate gene orders
aheatmap <- getWsHeatmap(BRCA_Expr_S1, res_combineEC$Cs[[1]], ws = ws_gene, Rowv=TRUE, main="S1, Expr")
bheatmap <- getWsHeatmap(BRCA_CNV_S1, res_combineEC$Cs[[1]], ws = ws_CNV, Rowv=TRUE, main="S1, CNV")
getWsHeatmap(BRCA_Expr_S1, res_combineEC$Cs[[1]], ws = ws_gene, geneOrder = aheatmap$rowInd, main="S1, Expr")
getWsHeatmap(BRCA_Expr_S2, res_combineEC$Cs[[2]], ws = ws_gene, geneOrder = aheatmap$rowInd, main="S2, Expr")
getWsHeatmap(BRCA_CNV_S1, res_combineEC$Cs[[1]], ws = ws_CNV, geneOrder = bheatmap$rowInd, main="S1, CNV")
getWsHeatmap(BRCA_CNV_S2, res_combineEC$Cs[[2]], ws = ws_CNV, geneOrder = bheatmap$rowInd, main="S2, CNV")

An example on the integration of gene expression data + methylation data.

We first need to combine the profiles of Expr and methylation

study1_combineEM <- rbind(BRCA_Expr_S1, BRCA_Meth_S1)
study2_combineEM <- rbind(BRCA_Expr_S2, BRCA_Meth_S2)
data_combineEM <- list(t(study1_combineEM), t(study2_combineEM))

Second we need to prepare the grouping information between Expr and methylation The grouping information should be input as a list:

omicsTyps_EM <- c(rep('gene',nrow(BRCA_Expr_S1)),rep('methyl',nrow(BRCA_Meth_S1)))
featureNames_EM <- c(rownames(BRCA_Expr_S1),rownames(BRCA_Meth_S1))
groups_EM <- split(1:length(featureNames_EM),featureNames_EM)

Third step is to perform MISKmeans

set.seed(15213)
res_combineEM <- MISKmeans(data_combineEM,K=3,group=groups_EM,gamma=0.7, silent=T)
table(res_combineEM$ws != 0)
res_combineEM$Cs

ws_gene <- res_combineEM$ws[omicsTyps_EM=="gene"]
table(ws_gene!=0)
ws_methyl <- res_combineEM$ws[omicsTyps_EM=="methyl"]
table(ws_methyl!=0)

visualization for the gene expression profile and CNV profile

aheatmap <- getWsHeatmap(BRCA_Expr_S1, res_combineEM$Cs[[1]], ws = ws_gene, Rowv=TRUE, main="S1, Expr")
cheatmap <- getWsHeatmap(BRCA_Meth_S1, res_combineEM$Cs[[1]], ws = ws_methyl, Rowv=TRUE, main="S1, methylation")
getWsHeatmap(BRCA_Expr_S1, res_combineEM$Cs[[1]], ws = ws_gene, geneOrder = aheatmap$rowInd, main="S1, Expr")
getWsHeatmap(BRCA_Expr_S2, res_combineEM$Cs[[2]], ws = ws_gene, geneOrder = aheatmap$rowInd, main="S2, Expr")
getWsHeatmap(BRCA_Meth_S1, res_combineEM$Cs[[1]], ws = ws_methyl, geneOrder = cheatmap$rowInd, main="S1, methylation")
getWsHeatmap(BRCA_Meth_S2, res_combineEM$Cs[[2]], ws = ws_methyl, geneOrder = cheatmap$rowInd, main="S2, methylation")

Compare the subtyping result by (gene expression data + CNV data) and (gene expression data + methylation data). Such comparison is performed by adjusted rand index. Details please refer to the real data example in the MISKmeans manuscript.

library(mclust)
adjustedRandIndex(res_combineEM$Cs[[1]], res_combineEC$Cs[[1]])
adjustedRandIndex(res_combineEM$Cs[[2]], res_combineEC$Cs[[2]])

An example on the integration of gene expression data + CNV data + methylation data.

We first need to combine the profiles of Expr, CNV and methylation

study1_combineECM <- rbind(BRCA_Expr_S1, BRCA_CNV_S1, BRCA_Meth_S1)
study2_combineECM <- rbind(BRCA_Expr_S2, BRCA_CNV_S1, BRCA_Meth_S2)
data_combineECM <- list(t(study1_combineECM), t(study2_combineECM))

Second we need to prepare the grouping information between Expr, CNV and methylation The grouping information should be input as a list:

omicsTyps_ECM <- c(rep('gene',nrow(BRCA_Expr_S1)),rep('CNV',nrow(BRCA_CNV_S1)),rep('methyl',nrow(BRCA_Meth_S1)))
featureNames_ECM <- c(rownames(BRCA_Expr_S1),rownames(BRCA_CNV_S1),rownames(BRCA_Meth_S1))
groups_ECM <- split(1:length(featureNames_ECM),featureNames_ECM)

Third step is to perform MISKmeans

set.seed(15213)
res_combineECM <- MISKmeans(data_combineECM,K=3,group=groups_ECM,gamma=0.7, silent=T)
table(res_combineECM$ws != 0)
res_combineECM$Cs

ws_gene <- res_combineECM$ws[omicsTyps_ECM=="gene"]
table(ws_gene!=0)
ws_CNV <- res_combineECM$ws[omicsTyps_ECM=="CNV"]
table(ws_CNV!=0)
ws_methyl <- res_combineECM$ws[omicsTyps_ECM=="methyl"]
table(ws_methyl!=0)

visualization for the gene expression profile, CNV profile and the methylation profile.

aheatmap <- getWsHeatmap(BRCA_Expr_S1, res_combineEM$Cs[[1]], ws = ws_gene, Rowv=TRUE)
bheatmap <- getWsHeatmap(BRCA_CNV_S1, res_combineEM$Cs[[1]], ws = ws_CNV, Rowv=TRUE)
cheatmap <- getWsHeatmap(BRCA_Meth_S1, res_combineEM$Cs[[1]], ws = ws_methyl, Rowv=TRUE)
getWsHeatmap(BRCA_Expr_S1, res_combineECM$Cs[[1]], ws = ws_gene, geneOrder = aheatmap$rowInd, main="S1, Expr")
getWsHeatmap(BRCA_Expr_S2, res_combineECM$Cs[[2]], ws = ws_gene,aheatmap$rowInd, main="S2, Expr")
getWsHeatmap(BRCA_CNV_S1, res_combineECM$Cs[[1]], ws = ws_CNV, geneOrder = bheatmap$rowInd, main="S1, CNV")
getWsHeatmap(BRCA_CNV_S2, res_combineECM$Cs[[2]], ws = ws_CNV, bheatmap$rowInd, main="S2, CNV")
getWsHeatmap(BRCA_Meth_S1, res_combineECM$Cs[[1]], ws = ws_methyl, geneOrder = cheatmap$rowInd, main="S1, methylation")
getWsHeatmap(BRCA_Meth_S2, res_combineECM$Cs[[2]], ws = ws_methyl, geneOrder = cheatmap$rowInd, main="S2, methylation")

Compare the subtyping result by ECM (gene expression data + CNV data + Methylation data) and results by EC and EM. Such comparison is performed by adjusted rand index. Details please refer to the real data example in the MISKmeans manuscript.

## ECM vs EC
adjustedRandIndex(res_combineECM$Cs[[1]], res_combineEC$Cs[[1]])
adjustedRandIndex(res_combineECM$Cs[[2]], res_combineEC$Cs[[2]])

## ECM vs EM
adjustedRandIndex(res_combineECM$Cs[[1]], res_combineEM$Cs[[1]])
adjustedRandIndex(res_combineECM$Cs[[2]], res_combineEM$Cs[[2]])

We can see from the clustering results for ECM and EM are the same, but EC result is a little bit different from ECM and EM.

Other issues.

  1. For number of clusters $K$, we would suggest use to use standard tools (e.g. gap statistics, prediction strength) to predict $K$ in each individual study and data types, and then do a joint prediction.

  2. For the tuning parameter gamma, though we have developed a method via gap statistics, but it doesn't work out very well as it will select more features than expected in real data. This is a hard problem we couldn't properly solve. We suggest users to try a range of gamma and check which one makes better biological sense. Below is an example how to try a range of gamma from 0.1 to 0.7.

{eval=F} res_combineECM <- MISKmeans(data_combineECM,K=3,group=groups_ECM,gamma=seq(0.1, 0.7, 0.1), silent=T)



Caleb-Huo/MIS-Kmeans documentation built on May 17, 2019, 2:45 p.m.