Introduction

   

IMIX is an R package for integration of multiple genomics data types to investigate the associations between genes and a specific outcome, including binary, continuous, survival, and categorical outcomes based on finite multivariate Gaussian mixture modelling using summary statistics. The input is summary statistics for each data type including either p-value or z-score. We support summary statistics of data outputs such as DNA methylation, copy number variation(CNV), and gene expression (RNAseq/microarray) at gene level. Nonethelss, IMIX is as flexible as it can be extended to other molecular level as long as the summary statistics of the multiple data types are coherent with each other. It provides features to select the true number of components behind the data, parameter estimation for the summary statistics via EM algorithm. The most important feature is that it evaluates the data through different covariance and mean structures of mixture modelling and selects the overall best fitting model, which in turn provides the oracle output while controlling for the across-data-type false discovery rate (FDR) at a user specified level.

This document gives a quick tour of IMIX functionalities. The tasks addressed in this package include assessment of the true number of components with respect to the data, identification of interesting genes for each data type combination, FDR control, and plotting functionality. See help(package="IMIX") for further details and references provided by citation("IMIX").

   

library("IMIX")

Data Preparation

   

Example data 1: p values for RNAseq and CNV data

Each row is a gene, each column is a data type. The dimension of the input data is $1000 \times 2$.

   

data("data_p")
dim(data_p)
head(data_p)

   

Example data 2: Simulate z scores for DNA methylation, RNAseq, and CNV data

The dimension is $1000 \times 3$.

   

library(MASS)
N <- 1000
truelabel <- sample(1:8,
                    prob = rep(0.125, 8),
                    size = N,
                    replace = TRUE)
mu1 <- c(0, 5)
mu2 <- c(0, 5)
mu3 <- c(0, 5)
mu1_mv <- c(mu1[1], mu2[1], mu3[1])
mu2_mv <- c(mu1[2], mu2[1], mu3[1])
mu3_mv <- c(mu1[1], mu2[2], mu3[1])
mu4_mv <- c(mu1[1], mu2[1], mu3[2])
mu5_mv <- c(mu1[2], mu2[2], mu3[1])
mu6_mv <- c(mu1[2], mu2[1], mu3[2])
mu7_mv <- c(mu1[1], mu2[2], mu3[2])
mu8_mv <- c(mu1[2], mu2[2], mu3[2])

cov_sim <- list()
for (i in 1:8) {
  cov_sim[[i]] <- diag(3)
}
data_z <- array(0, c(N, 3))
data_z[which(truelabel == 1),] <- mvrnorm(
  n = length(which(truelabel == 1)),
  mu = mu1_mv,
  Sigma = cov_sim[[1]],
  tol = 1e-6,
  empirical = FALSE
)
data_z[which(truelabel == 2),] <- mvrnorm(
  n = length(which(truelabel == 2)),
  mu = mu2_mv,
  Sigma = cov_sim[[2]],
  tol = 1e-6,
  empirical = FALSE
)
data_z[which(truelabel == 3),] <- mvrnorm(
  n = length(which(truelabel == 3)),
  mu = mu3_mv,
  Sigma = cov_sim[[3]],
  tol = 1e-6,
  empirical = FALSE
)
data_z[which(truelabel == 4),] <- mvrnorm(
  n = length(which(truelabel == 4)),
  mu = mu4_mv,
  Sigma = cov_sim[[4]],
  tol = 1e-6,
  empirical = FALSE
)
data_z[which(truelabel == 5),] <- mvrnorm(
  n = length(which(truelabel == 5)),
  mu = mu5_mv,
  Sigma = cov_sim[[5]],
  tol = 1e-6,
  empirical = FALSE
)
data_z[which(truelabel == 6),] <- mvrnorm(
  n = length(which(truelabel == 6)),
  mu = mu6_mv,
  Sigma = cov_sim[[6]],
  tol = 1e-6,
  empirical = FALSE
)
data_z[which(truelabel == 7),] <- mvrnorm(
  n = length(which(truelabel == 7)),
  mu = mu7_mv,
  Sigma = cov_sim[[7]],
  tol = 1e-6,
  empirical = FALSE
)
data_z[which(truelabel == 8),] <- mvrnorm(
  n = length(which(truelabel == 8)),
  mu = mu8_mv,
  Sigma = cov_sim[[8]],
  tol = 1e-6,
  empirical = FALSE
)

rownames(data_z) <- paste0("gene", 1:N)
colnames(data_z) <- c("z.methy", "z.ge", "z.cnv")
dim(data_z)

   

Model selection: select the number of components for the mixture distribution

 

Example 1: Two data types, p values

 

select_comp1 <- model_selection_component(data_p, data_type = "p", seed = 20)

   

names(select_comp1)
select_comp1$Component_Selected_AIC
select_comp1$Component_Selected_BIC

 

The model selected 3 components out of 4. Then we visualize it.

plot_component(select_comp1, type = "AIC")
plot_component(select_comp1, type = "BIC")

   

Example 2: Three data types, z scores

 

select_comp2 <- model_selection_component(data_z, data_type = "z")
names(select_comp2)
select_comp2$Component_Selected_AIC
select_comp2$Component_Selected_BIC

   

The model selected all 8 components. Then we visualize it.

plot_component(select_comp2, type = "AIC")
plot_component(select_comp2, type = "BIC")

   

Integrative genomics test for two and three omics data types

   

Example 1: Two data types, p values

 

Initial values for the p transformed z score mixture model, this step can be skipped.

 

Initial values for the mean vector for (null_1,alternative_1,null_2,alternative_2) in data type 1 and data type 2, here we assume for data type 1, the mean of null distribution is 0, alternative distribution is 3; for data type 2, the mean of null distribution is 0, alternative distribution is 3.

mu_input <- c(0,3,0,3)

Initial values for the standard deviation for (null_1,alternative_1,null_2,alternative_2)

sigma_input <- rep(1,4)

Initial values for the proportion of components in each data type respectively following
(null_1,alternative_1,null_2,alternative_2), here we assume for data type 1, the proportions of null distribution and alternative distribution are both 0.5 (adding together is 1); for data type 2, the proportions of null distribution and alternative distribution are both 0.5 (adding together is 1).

p_input <- rep(0.5,4)

   

Start the test

test1 <- IMIX(
  data_input = data_p,
  data_type = "p",
  mu_ini = mu_input,
  sigma_ini = sigma_input,
  p_ini = p_input,
  alpha = 0.1,
  model_selection_method = "AIC"
  )

 

Result outputs of example 1

test1$estimatedFDR # Print the estimated across-data-type FDR for each component
test1$`AIC/BIC` # The AIC and BIC values for each model
test1$`Selected Model` # The best fitted model selected by AIC
str(test1$IMIX_cor_twostep)
dim(test1$significant_genes_with_FDRcontrol)
head(test1$significant_genes_with_FDRcontrol)

   

The results for each gene, this includes localFDR, classes with across-data-type FDR control at $\alpha=0.1$ and classes without across-data-type FDR control. Here the class labels corresponds to 1=(ge-,cnv-),2=(ge+,cnv-),3=(ge-,cnv+),4=(ge+,cnv+). We could see that component 3 is missing here after controlling for FDR, and there are only 9 genes in component 3 before we control for FDR. This result is coherent with the model selection result.

   

Example 2: Three data types, tranformed z values

 

IMIX test without specifying the initial values of the parameters

 

test2 <- IMIX(
  data_input = data_z,
  data_type = "z",
  alpha = 0.05,
  verbose = TRUE
)

 

Results of example 2

test2$estimatedFDR
test2$`AIC/BIC`
test2$`Selected Model`  # The best fitted model selected by BIC
str(test2$IMIX_ind)
dim(test2$significant_genes_with_FDRcontrol)
head(test2$significant_genes_with_FDRcontrol)

   

For the output includes each model parameter estimations, the model selection AIC and BIC values and the best selected model. The estimated FDR corresponding to the prespecified $\alpha=0.05$ threshold, the local FDR and class labels for each gene, both without FDR control and based on the FDR control at $\alpha=0.05$. Here the class labels corresponds to 1=(meth-,ge-,cnv-),2=(meth+,ge-,cnv-),3=(meth-,ge+,cnv-),4=(meth-,ge-,cnv+),5=(meth+,ge+,cnv-),6=(meth+,ge-,cnv+),7=(meth-,ge+,cnv+),8=(meth+,ge+,cnv+).

   

Additional FDR control

   

We provide two additional FDR control functions for the IMIX test result to achieve

  1. Another prespecified $\alpha$ level than the one prespecified using the IMIX function, users do not need to rerun IMIX test for another nominal level for the test result.

  2. A combination of multiple components for the IMIX result output, for example, if we want to control the FDR for genes that are associated with the outcome through gene expression, regardless of cnv, then the combination of components would be (ge+,cnv-) and (ge+,cnv+).

 

A different $\alpha$ threshold from IMIX() function

 

Users can always have results controlling for FDR at a different $\alpha$ level than the output in IMIX() function, with no need to rerun the IMIX() test. Below is an R code example for data example 2 in controlling the FDR at $\alpha=0.2$.

fdr_control1 <- FDR_control_adaptive_imix(imix_output = test2, model = "IMIX_cor", alpha = 0.2) 
# The input is the result output from IMIX() function
fdr_control1$estimatedFDR
head(fdr_control1$significant_genes_with_FDRcontrol)
table(fdr_control1$significant_genes_with_FDRcontrol$class_FDRcontrol)

   

A combination of multiple components

 

Users can always have results controlling for FDR for a combination of components, we use data example 1 to illustrate this. Suppose we are interested in the genes that are associated with the outcome through gene expression, regardless of whether it associated with CNV, then we want to control FDR for (ge+,cnv-) and (ge+,cnv+). The corresponding label is class 2 & 4. Below is an R code example for data example 1 in controlling the FDR at $\alpha=0.2$ for component 2 & component 4.

test1$`Selected Model`
lfdr_ge_combined <- 1 - (test1$IMIX_cor_twostep$`posterior prob`[,2] + 
                          test1$IMIX_cor_twostep$`posterior prob`[,4]) 
# Class 2: (ge+,cnv-); class 4: (ge+,cnv+)
names(lfdr_ge_combined) <- rownames(test1$IMIX_cor_twostep$`posterior prob`)
fdr_control2 <- FDR_control_adaptive(lfdr = lfdr_ge_combined, alpha = 0.2)
fdr_control2$estimatedFDR
table(fdr_control2$significant_genes_with_FDRcontrol)

The result output fdr_control2 shows that 693 genes (indicator 1) are associated with the outcome through gene expression at FDR $\alpha=0.2$ for data example 1. The estimated mFDR is 0.1995.

     

sessionInfo()

References



ziqiaow/IMIX documentation built on Nov. 19, 2022, 11:18 a.m.