IMIX is an R package for integration of multiple genomics data types to investigate the associations between genes and a specific outcome, including binary, continuous, survival, and categorical outcomes based on finite multivariate Gaussian mixture modelling using summary statistics. The input is summary statistics for each data type including either p-value or z-score. We support summary statistics of data outputs such as DNA methylation, copy number variation(CNV), and gene expression (RNAseq/microarray) at gene level. Nonethelss, IMIX is as flexible as it can be extended to other molecular level as long as the summary statistics of the multiple data types are coherent with each other. It provides features to select the true number of components behind the data, parameter estimation for the summary statistics via EM algorithm. The most important feature is that it evaluates the data through different covariance and mean structures of mixture modelling and selects the overall best fitting model, which in turn provides the oracle output while controlling for the across-data-type false discovery rate (FDR) at a user specified level.
This document gives a quick tour of IMIX functionalities. The tasks addressed in this package include assessment of the true number of components with respect to the data, identification of interesting genes for each data type combination, FDR control, and plotting functionality. See help(package="IMIX") for further details and references provided by citation("IMIX").
library("IMIX")
Each row is a gene, each column is a data type. The dimension of the input data is $1000 \times 2$.
data("data_p") dim(data_p) head(data_p)
The dimension is $1000 \times 3$.
library(MASS) N <- 1000 truelabel <- sample(1:8, prob = rep(0.125, 8), size = N, replace = TRUE) mu1 <- c(0, 5) mu2 <- c(0, 5) mu3 <- c(0, 5) mu1_mv <- c(mu1[1], mu2[1], mu3[1]) mu2_mv <- c(mu1[2], mu2[1], mu3[1]) mu3_mv <- c(mu1[1], mu2[2], mu3[1]) mu4_mv <- c(mu1[1], mu2[1], mu3[2]) mu5_mv <- c(mu1[2], mu2[2], mu3[1]) mu6_mv <- c(mu1[2], mu2[1], mu3[2]) mu7_mv <- c(mu1[1], mu2[2], mu3[2]) mu8_mv <- c(mu1[2], mu2[2], mu3[2]) cov_sim <- list() for (i in 1:8) { cov_sim[[i]] <- diag(3) } data_z <- array(0, c(N, 3)) data_z[which(truelabel == 1),] <- mvrnorm( n = length(which(truelabel == 1)), mu = mu1_mv, Sigma = cov_sim[[1]], tol = 1e-6, empirical = FALSE ) data_z[which(truelabel == 2),] <- mvrnorm( n = length(which(truelabel == 2)), mu = mu2_mv, Sigma = cov_sim[[2]], tol = 1e-6, empirical = FALSE ) data_z[which(truelabel == 3),] <- mvrnorm( n = length(which(truelabel == 3)), mu = mu3_mv, Sigma = cov_sim[[3]], tol = 1e-6, empirical = FALSE ) data_z[which(truelabel == 4),] <- mvrnorm( n = length(which(truelabel == 4)), mu = mu4_mv, Sigma = cov_sim[[4]], tol = 1e-6, empirical = FALSE ) data_z[which(truelabel == 5),] <- mvrnorm( n = length(which(truelabel == 5)), mu = mu5_mv, Sigma = cov_sim[[5]], tol = 1e-6, empirical = FALSE ) data_z[which(truelabel == 6),] <- mvrnorm( n = length(which(truelabel == 6)), mu = mu6_mv, Sigma = cov_sim[[6]], tol = 1e-6, empirical = FALSE ) data_z[which(truelabel == 7),] <- mvrnorm( n = length(which(truelabel == 7)), mu = mu7_mv, Sigma = cov_sim[[7]], tol = 1e-6, empirical = FALSE ) data_z[which(truelabel == 8),] <- mvrnorm( n = length(which(truelabel == 8)), mu = mu8_mv, Sigma = cov_sim[[8]], tol = 1e-6, empirical = FALSE ) rownames(data_z) <- paste0("gene", 1:N) colnames(data_z) <- c("z.methy", "z.ge", "z.cnv") dim(data_z)
select_comp1 <- model_selection_component(data_p, data_type = "p", seed = 20)
names(select_comp1) select_comp1$Component_Selected_AIC select_comp1$Component_Selected_BIC
The model selected 3 components out of 4. Then we visualize it.
plot_component(select_comp1, type = "AIC") plot_component(select_comp1, type = "BIC")
select_comp2 <- model_selection_component(data_z, data_type = "z") names(select_comp2) select_comp2$Component_Selected_AIC select_comp2$Component_Selected_BIC
The model selected all 8 components. Then we visualize it.
plot_component(select_comp2, type = "AIC") plot_component(select_comp2, type = "BIC")
Initial values for the p transformed z score mixture model, this step can be skipped.
Initial values for the mean vector for (null_1,alternative_1,null_2,alternative_2) in data type 1 and data type 2, here we assume for data type 1, the mean of null distribution is 0, alternative distribution is 3; for data type 2, the mean of null distribution is 0, alternative distribution is 3.
mu_input <- c(0,3,0,3)
Initial values for the standard deviation for (null_1,alternative_1,null_2,alternative_2)
sigma_input <- rep(1,4)
Initial values for the proportion of components in each data type respectively following
(null_1,alternative_1,null_2,alternative_2), here we assume for data type 1, the proportions of null distribution and alternative distribution are both 0.5 (adding together is 1); for data type 2, the proportions of null distribution and alternative distribution are both 0.5 (adding together is 1).
p_input <- rep(0.5,4)
Start the test
test1 <- IMIX( data_input = data_p, data_type = "p", mu_ini = mu_input, sigma_ini = sigma_input, p_ini = p_input, alpha = 0.1, model_selection_method = "AIC" )
Result outputs of example 1
test1$estimatedFDR # Print the estimated across-data-type FDR for each component test1$`AIC/BIC` # The AIC and BIC values for each model test1$`Selected Model` # The best fitted model selected by AIC str(test1$IMIX_cor_twostep) dim(test1$significant_genes_with_FDRcontrol) head(test1$significant_genes_with_FDRcontrol)
The results for each gene, this includes localFDR, classes with across-data-type FDR control at $\alpha=0.1$ and classes without across-data-type FDR control. Here the class labels corresponds to 1=(ge-,cnv-),2=(ge+,cnv-),3=(ge-,cnv+),4=(ge+,cnv+). We could see that component 3 is missing here after controlling for FDR, and there are only 9 genes in component 3 before we control for FDR. This result is coherent with the model selection result.
IMIX test without specifying the initial values of the parameters
test2 <- IMIX( data_input = data_z, data_type = "z", alpha = 0.05, verbose = TRUE )
Results of example 2
test2$estimatedFDR test2$`AIC/BIC` test2$`Selected Model` # The best fitted model selected by BIC str(test2$IMIX_ind) dim(test2$significant_genes_with_FDRcontrol) head(test2$significant_genes_with_FDRcontrol)
For the output includes each model parameter estimations, the model selection AIC and BIC values and the best selected model. The estimated FDR corresponding to the prespecified $\alpha=0.05$ threshold, the local FDR and class labels for each gene, both without FDR control and based on the FDR control at $\alpha=0.05$. Here the class labels corresponds to 1=(meth-,ge-,cnv-),2=(meth+,ge-,cnv-),3=(meth-,ge+,cnv-),4=(meth-,ge-,cnv+),5=(meth+,ge+,cnv-),6=(meth+,ge-,cnv+),7=(meth-,ge+,cnv+),8=(meth+,ge+,cnv+).
We provide two additional FDR control functions for the IMIX test result to achieve
Another prespecified $\alpha$ level than the one prespecified using the IMIX function, users do not need to rerun IMIX test for another nominal level for the test result.
A combination of multiple components for the IMIX result output, for example, if we want to control the FDR for genes that are associated with the outcome through gene expression, regardless of cnv, then the combination of components would be (ge+,cnv-) and (ge+,cnv+).
Users can always have results controlling for FDR at a different $\alpha$ level than the output in IMIX() function, with no need to rerun the IMIX() test. Below is an R code example for data example 2 in controlling the FDR at $\alpha=0.2$.
fdr_control1 <- FDR_control_adaptive_imix(imix_output = test2, model = "IMIX_cor", alpha = 0.2) # The input is the result output from IMIX() function fdr_control1$estimatedFDR head(fdr_control1$significant_genes_with_FDRcontrol) table(fdr_control1$significant_genes_with_FDRcontrol$class_FDRcontrol)
Users can always have results controlling for FDR for a combination of components, we use data example 1 to illustrate this. Suppose we are interested in the genes that are associated with the outcome through gene expression, regardless of whether it associated with CNV, then we want to control FDR for (ge+,cnv-) and (ge+,cnv+). The corresponding label is class 2 & 4. Below is an R code example for data example 1 in controlling the FDR at $\alpha=0.2$ for component 2 & component 4.
test1$`Selected Model` lfdr_ge_combined <- 1 - (test1$IMIX_cor_twostep$`posterior prob`[,2] + test1$IMIX_cor_twostep$`posterior prob`[,4]) # Class 2: (ge+,cnv-); class 4: (ge+,cnv+) names(lfdr_ge_combined) <- rownames(test1$IMIX_cor_twostep$`posterior prob`) fdr_control2 <- FDR_control_adaptive(lfdr = lfdr_ge_combined, alpha = 0.2) fdr_control2$estimatedFDR table(fdr_control2$significant_genes_with_FDRcontrol)
The result output fdr_control2 shows that 693 genes (indicator 1) are associated with the outcome through gene expression at FDR $\alpha=0.2$ for data example 1. The estimated mFDR is 0.1995.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.