library(knitr) library(rmarkdown) library(FastMix) ### some global parameters m = 5000 n = 50 bb <- c(1.5, 0, 0) aa <- rep(0, 6); aa[1] <- .75 sigma2.u <- 1 sigma2.e <- 0.6 sigma2.alpha <- 1.2 sigma2.err <- 1/16 # the data generation for main simulation data_gen <- function(m, n, seed = 1, outlier = T, cor = 0, balance = T, weight = F){ ## generate some new simulated data n <- n set.seed(seed) CellProp <- cbind(Cell1=runif(n, .2, .3), Cell2=runif(n, .2, .3), Cell3=runif(n, .2, .3)) rownames(CellProp) <- paste0("Subj", 1:n) Demo <- cbind(Severity=round(runif(n, 0, 10), 1), Sex=c(rep(0, n/2), rep(1, n/2))) rownames(Demo) <- paste0("Subj", 1:n) ## generate some true signals m <- m #genes bb <- c(1.5, 0, 0) #assoc. w. CellProp sigma2.u <- 1 sigma2.e <- 0.6 sigma2.alpha <- 1.2 sigma2.err <- 1/16 #sigma2.err <- 0.5 #i.i.d. noise ## mean interactions aa <- rep(0, 6); aa[1] <- .75 Bmat <- t(replicate(m, c(bb, 0, 0, aa))) #Bmat <- t(replicate(m, c(1.5, 0.5, 0.5, -0.5, -0.5, 0.75, 0.5, 0.5, 0.5, -0.5, -0.5))) L <- ncol(Bmat) sdvec <- (c(rep(sigma2.u, 3), rep(sigma2.e,2), rep(sigma2.alpha, 6))) if(outlier == T){ ### the location shift idx = c(1,4,7,10) str = sdvec[idx] * 3.5 OutSig = matrix(NA, ncol = 4, nrow = m) for(i in 1:4){ OutSig[,i] <- c(rep(0, m * 0.05 * (i-1)), rep(str[i], m * 0.05), rep(0, m*(1 - i * 0.05))) } if(balance == T) OutSig <- OutSig * (2*rbinom(length(OutSig), 1, .5) - 1) # approximately 0 mean else if (balance == F) OutSig <- OutSig * (2*rbinom(length(OutSig), 1, 0.9) - 1) # approximately 0 mean Bmat[,idx] = Bmat[,idx] + OutSig if(cor == 0){ Gmat <- matrix(rnorm(L*m),m,L) %*% diag(sdvec) sigmamat = diag(sdvec^2) for(i in 1:4){ Gmat[(m * 0.05 * (i-1)+1):(m * 0.05 * i), idx[i]] <- Gmat[(m * 0.05 * (i-1)+1):(m * 0.05 * i), idx[i]] } } else{ cormat <- matrix(cor, ncol = 11, nrow = 11) diag(cormat) = 1 sigmamat <- sweep(sweep(cormat, 1, sdvec, "*"), 2, sdvec, "*") a = chol(sigmamat) Gmat <- matrix(rnorm(L*m),m,L) %*% a for(i in 1:4){ Gmat[(m * 0.05 * (i-1)+1):(m * 0.05 * i), idx[i]] <- Gmat[(m * 0.05 * (i-1)+1):(m * 0.05 * i), idx[i]] } } } else{ Bmat = Bmat if(cor == 0){ Gmat <- matrix(rnorm(L*m),m,L) %*% diag(sdvec) sigmamat = diag(sdvec^2) } else{ cormat <- matrix(cor, ncol = 11, nrow = 11) diag(cormat) = 1 sigmamat <- sweep(sweep(cormat, 1, sdvec, "*"), 2, sdvec, "*") a = chol(sigmamat) Gmat <- matrix(rnorm(L*m),m,L) %*% a } } ## generate the gene expression if(typeof(weight) == "logical"){ if(weight == F){ Err <- matrix(rnorm(m*n, sd=sqrt(sigma2.err)), nrow=m, ncol=n) } } else{ # i.d but not i.i.d Err <- matrix(rnorm(m*n, sd=sqrt(sigma2.err)), nrow=m, ncol=n) ### the cor structure weight = weight e = eigen(weight) w = e$vectors %*% sqrt(diag(e$values)) %*% t(e$vectors) Err = Err %*% w } colnames(Err) <- paste0("Subj", 1:n); rownames(Err) <- paste0("Gene", 1:m) ## Use DataPrep to get Xmat; Y is irrelevant Xmat <- t(DataPrep(Err, CellProp, Demo, w = diag(rep(1, n)))$X[1:n, -1]) ## the gene expression matrix GeneExp <- (Bmat + Gmat) %*% Xmat + Err return(list(GeneExp = GeneExp, CellProp = CellProp, Demo = Demo, Bmat = Bmat, Gmat = Gmat, sigmamat = sigmamat, Err = Err)) }
In this Vignette, we describe the basic usage of the FastMix
package. This package implements a fast parameter estimation and hypothesis testing framework for a special type of linear mixed effects model inspired by the deconvolution project. Below we describe the inspiring problem, the mathematical model, and the usage of the main R function, FastMix()
.
We will briefly describe a motivating problem for which the FastMix
package is applicable.
In most genomic studies, gene expressions are quantified from a mixture of heterogeneous cells. In certain cases, we may have matching phenotype data (often from flow-cytometry) which quantify the proportion of $k=1, 2, \dots, K$ types of cells in the sample. One popular task is to use the observed cell proportions to "deconvolute" the mixed gene expressions, namely, to obtain cell-specific gene expressions for each subject (henceforth denoted as $b_{kij}$, for the $k$th cell, $i$th gene, $j$th subject).
In a sense, the need for producing subject-specific expressions is primarily driven by the downstream statistical analysis, in which $b_{kij}$ are considered as new observations to be associated with subject-specific clinical covariates, such as the disease severity, age, and sex.
From the statistical perspective, the main difficulty of the above approach is that the deconvolution step is a typical "large $p$, small $n$" problem. For each gene, we have $n$ observations of the mixed gene expression; yet our goal is to produce $Kn$ estimates for each subject ($i=1, 2, \dots, n$) and each type of cell ($k=1, 2, \dots, K$). While it is theoretically doable based on certain penalized regression, the resulting cell-specific expressions will inevitably have high level of variance.
In this study, we take a different approach, namely, to combine the deconvolution step with the downstream analysis into a single mixed effects model. The advantages of doing so are:
In summary, we have designed a one-step inferential framework based on mixed effects regression and outlier analysis for the deconvolution analysis that greatly reduces the model complexity, yet still flexible enough to produce estimates of the association between the cell-specific gene expressions and clinical covariates, as well as their corresponding $p$-values for each gene, cell, and covariate.
Based on a simple additive model, $Y_{ji}$, the expression level of the $i$th gene, $i=1, 2, \dots, m$ for the $j$th subject, $j=1, 2, \dots, n$, can be described as
$$Y_{ji} = \sum_{k=1}^{K} C_{jk} b_{kij} + \sum_{p=1}^{P} Z_{jp} d_{pi} + \epsilon_{ji}. $$
Here $C_{jk}$ is the proportion of the $k$th cell for the $j$th subject (measured by e.g., flow-cytometry); $b_{kij}$ is the expression of the $i$th gene produced by one unit of $k$th cell for the $j$th subject; $Z_{jp}$ is the value of the $p$th standardized clinical variable (such as age, sex, and severity; standardized so that all of them have mean zero and unit standard deviation therefore directly comparable) of the $j$th subject; $d_{pi}$ is the association of the $p$th clinical covariate to the $i$th gene; and $\epsilon_{ji}$ is the error term. We further assume that the cell-specific expression, $b_{kij}$, and covariate-specific association, $d_{pi}$, have the following mixed effects structure
$$ b_{kij} = b_{k} + u_{ki} + \sum_{p=1}^{P} Z_{jp} (a_{pk} + \alpha_{pki}), \qquad d_{pi} = \delta_{p} + e_{pi}. $$
Here $b_{k}$ is the mean contribution to expression from the $k$th cell to all genes; $u_{ki}$ is a random effect term specific to the $i$th gene; $a_{pk}$ represents the mean (over all subjects and genes) association between $Z_{jp}$ and $b_{kij}$; and $\alpha_{pki}$ is a random effect term that describes the association of $Z_{jp}$ and a specific gene; $\delta_{p}$ is the overall association of $p$ to all genes and $e_{pi}$ is a random effect that describes the association between the $p$th covariate and the $i$th gene.
We may merge the first two equations and obtain the following combined model
$$ Y_{ji} = \sum_{l = 1}^{L} X_{jl} \left( \beta_{l} + \gamma_{li} \right) + \epsilon_{ji}. $$
Here $X_{\cdot, \cdot}$ is the combined covariate matrix and it includes $C_{jk}$, $Z_{jp}$, and $C_{jk} Z_{jp}$. For example, if we have three types of cells and two clinical covariates, the total number of covariates would be $L = 3 + 2 + 3\times 2 = 11$. $\beta_{l}$ is the contribution of $X_{\cdot,l}$ (could be $C$, $Z$, or their interaction) to the entire transcriptome, and $\gamma_{li}$ is the random effect term that characterizes the effect of $X_{\cdot,l}$ to a specific gene.
The above model is a standard LMER so it can be fitted by a stock mixed effects regression software, such as R package lme4
, the reference implementation of LMER in R. However, due to the high-throughput nature of the data, fitting such a large regression model is computationally very demanding. Furthermore, there are occasional convergence issues due to the use of the EM algorithm in lme4
to the high-throughput data, which could do harm to the parameter estimation of interest. See the simulation studies in document FAST_LMER_project.pdf
.
Instead, we developed an efficient estimation framework based on an initial OLS regression, moment method, and the EBLUP. For high-throughput data, the proposed method achieves comparable accuracy with the EM-based LMER ï¬tting algorithm with only a fraction of computational time.
In addition, we develop a novel competitive hypothesis test to identify genes that have signiï¬cantly larger or smaller predicted random eï¬ect with a given covariatewe. It means for a given $l$, whether the estimated random effect for a specific gene ($\gamma_{li}$) is significantly different from its "main distribution". Specifically, we consider that the previous combined model only describes the distribution of the majority of genes; and there may be small subsets of genes that have different distribution of linear coefficients. In other words, if gene $i$ is in this subset, its linear coefficient associated some $X_{\cdot l}$ does not follow $N(\beta_{l}, \sigma_{\gamma_{l}}^2)$, and could be detected as an outlier from the major distribution.
The proposed outlier detection includes two steps: debias the covariance estimation of random effects caused by outliers and do z-test using weighted EBLUP . Step one is the core part and we briefly describe how to achieve it here: we first find the potential direction with outliers by a normal test of weighted EBLUP; we then construct a $\chi^2$-type statistics from OLS estimation and trim the extreme values to do gene selection; lastly we recover the covariance estimation of random effects according to the proporty of trimmed $\chi^2$-distribution.
Technical details and simulation results are summarized in another document, FAST_LMER_project.pdf
.
The primary function in our package is FastMix()
. It takes the following primary arguments: GeneExp
, CellProp
, Clinical
, random
and robust
. Here GeneExp
is a $m\times n$ dimensional gene expression matrix; CellProp
is a $n\times K$ dimensional matrix of cell proportions; Clinical
is a $n\times P$ dimensional matrix of clinical and demographic variables to be tested; random
is an index vector that specifies which variable(s) requires random effects -- by default, all covariates are paired with a random effect and robust
specifies whether robust covariance estimation is implemented and which method to use: "FALSE" for non-robust estimation; "mcd" for the MCD algorithm of Rousseeuw and Van Driessen; "weighted" for the Reweighted MCD; "donostah" for the Donoho-Stahel projection based estimator; "pairwiseQC" for the orthogonalized quadrant correlation pairwise estimator. All these algorithms come from the R package robust
. "FastMix" is the proposed trimming method.
We first generate a set of data, called dat_train
, which contains $m=r m
$ genes and $n=r n
$ subjects. There are three cell types and the data are stored in object CellProp
. I created a clinical dataset called Demo
, which consists of two variables, Severity
and Sex
. Based on the previous discussion, the observed bulk gene expression may be associated with a total of $L=11$ covariates: 3 cell proportions, 2 clinical variables, and 6 of the interaction terms.
I created true associations between the covariates and gene expressions in this way.
r bb[1]
$, $b_2 = r bb[2]
$, $b_3 = r bb[3]
$. r sigma2.u
$.r sigma2.e
$.r aa[1]
$, $\sigma_{\alpha} = r sigma2.alpha
$. r sigma2.err
.The aforementioned data is provided as dat_train
in dataexample.rda
, distributed along with this R package, to examplify the simulated data. dat_train
is a list of three objects: (a) object CellProp
, which stores the proportion of cell types for each subject with 50 rows and 3 columns; object Demo
, a clinical dataset which consists of two variables, Severity
and Sex
, for each subjects with 50 rows and 2 columns, and (c) object GeneExp
, a matrix with 5000 rows and 50 columns, stores the outcome of interest which consists of gene expression level with respect to subjects and genes. These three objects are required inputs for function FastMix()
. Based on the previous discussion, the observed bulk gene expression may be associated with a total of $L=11$ covariates: 3 cell proportions, 2 clinical variables, and 6 of the interaction terms.
Below is the default way of using FastMix
to analyze the simulated data
library(FastMix) data(dat_train) #loads the sample data #vignette("FastMix") #shows this vignette ## fitting the gene expression data, cell proportions, and demographic ## data in one step. It takes about 10~15 seconds to finish computing system.time(mod1 <- FastMix(dat_train$GeneExp, dat_train$CellProp, dat_train$Demo, independent = F)) ### add this exmaple for later use: test score_func when 'include.demo = F' system.time(mod1_F <- FastMix(dat_train$GeneExp, dat_train$CellProp, dat_train$Demo, independent = F, include.demo = F))
Now let us summarize the results. First, let us check the fixed effects (the overall association between the covariates, their interactions, and the transcriptome
## 1. The overall association between covariates (with interactions) ## and the entire transcriptome. We know that Cell1 is significantly ## associated with all genes; and Cell1 has a significant interaction ## with Severity (but not Sex). All other nine covariates are not ## significant. kable(round(mod1$fixed.results, 4))
As we can see, FastMix correctly identified the only two truly significant covariates (Cell1 and the interaction between Cell1 and Severity).
Next, we check the statistical power and type I error rate for the random effects.
## (a) The first 250 genes have significant association with Cell1, ## (b) genes 251-500 has significant association with Severity, (c) ## genes 501 - 750 are significantly associated with Cell2.Severity, ## and (d) genes 751 - 1000 are significantly associated with Cell2.Sex. m <- nrow(dat_train$GeneExp) n = nrow(dat_train$Demo) rr.cell1 <- c(Power=sum(mod1$re.ind.pvalue[1:250, "Cell1"]<0.05)/250, TypeI.Err=sum(mod1$re.ind.pvalue[251:m, "Cell1"]<0.05)/(m-250)) rr.severity <- c(Power=sum(mod1$re.ind.pvalue[251:500, "Severity"]<0.05)/250, TypeI.Err=sum(mod1$re.ind.pvalue[c(1:250, 501:m), "Severity"]<0.05)/(m-250)) rr.cell2.severity <- c(Power=sum(mod1$re.ind.pvalue[501:750, "Cell2.Severity"]<0.05)/250, TypeI.Err=sum(mod1$re.ind.pvalue[c(1:500, 751:m), "Cell2.Severity"]<0.05)/(m-250)) rr.cell2.sex <- c(Power=sum(mod1$re.ind.pvalue[751:1000, "Cell2.Sex"]<0.05)/250, TypeI.Err=sum(mod1$re.ind.pvalue[c(1:750, 1001:m), "Cell2.Sex"]<0.05)/(m-250)) rr.others <- c(Power=NA, TypeI.Err=sum(mod1$re.ind.pvalue[, -c(1,4,10)]<0.05)/(m*8)) retab <- cbind(Cell1=rr.cell1, Severity=rr.severity, Cell2.Severity=rr.cell2.severity, Cell2.Sex=rr.cell2.sex, Others=rr.others) kable(round(retab, 3))
Judging from the above table, we see that for all covariates, the type I error is controlled at the nominal level. We also noticed that for all covariates, we have relatively good power to detect gene-specific associations between the covariates, including the interactions between cell proportion, clinical variables, and gene expressions.
In many applications, the noise may not be i.i.d. One such example is that samples in the study may have different Total Number of Mapped Reads (TNMR, a.k.a. sequencing depths). This information is typically provided by the researchers as a quality metric, and can be considered as inversely proportional to the variance of the (properly normalized) expression levels in a sample. In this case, we argue that using weights that are proportional to the TNMR will give us better results than a comparable model that treats all samples as equal.
In this section, we consider the following simulation study:
set.seed(1234) ## Total number of mapped reads, a.k.a., sequencing depth, in millions. TNMR <- runif(50, min=1, max=5) ## We assume that the variance is proportional to TNMR for ## normalized expressions ## we use sd = 1 as the baseline cov_matrix <- diag(max(TNMR)/TNMR) ### we generate a weighted data dat = data_gen(m, n, seed = 1, outlier = T, cor = 0.5, balance = F, weight = cov_matrix) system.time(mod2 <- FastMix(dat$GeneExp, dat$CellProp, dat$Demo, independent = F, cov_matrix = cov_matrix)) rr.cell1 <- c(Power=sum(mod2$re.ind.pvalue[1:250, "Cell1"]<0.05)/250, TypeI.Err=sum(mod2$re.ind.pvalue[251:m, "Cell1"]<0.05)/(m-250)) rr.severity <- c(Power=sum(mod2$re.ind.pvalue[251:500, "Severity"]<0.05)/250, TypeI.Err=sum(mod2$re.ind.pvalue[c(1:250, 501:m), "Severity"]<0.05)/(m-250)) rr.cell2.severity <- c(Power=sum(mod2$re.ind.pvalue[501:750, "Cell2.Severity"]<0.05)/250, TypeI.Err=sum(mod2$re.ind.pvalue[c(1:500, 751:m), "Cell2.Severity"]<0.05)/(m-250)) rr.cell2.sex <- c(Power=sum(mod2$re.ind.pvalue[751:1000, "Cell2.Sex"]<0.05)/250, TypeI.Err=sum(mod2$re.ind.pvalue[c(1:750, 1001:m), "Cell2.Sex"]<0.05)/(m-250)) rr.others <- c(Power=NA, TypeI.Err=sum(mod2$re.ind.pvalue[, -c(1,4,10)]<0.05)/(m*8)) retab <- cbind(Cell1=rr.cell1, Severity=rr.severity, Cell2.Severity=rr.cell2.severity, Cell2.Sex=rr.cell2.sex, Others=rr.others) kable(round(retab, 3)) set.seed(1234) system.time(mod3 <- FastMix(dat$GeneExp, dat$CellProp, dat$Demo, independent = F, cov_matrix = NULL)) rr.cell1 <- c(Power=sum(mod3$re.ind.pvalue[1:250, "Cell1"]<0.05)/250, TypeI.Err=sum(mod3$re.ind.pvalue[251:m, "Cell1"]<0.05)/(m-250)) rr.severity <- c(Power=sum(mod3$re.ind.pvalue[251:500, "Severity"]<0.05)/250, TypeI.Err=sum(mod3$re.ind.pvalue[c(1:250, 501:m), "Severity"]<0.05)/(m-250)) rr.cell2.severity <- c(Power=sum(mod3$re.ind.pvalue[501:750, "Cell2.Severity"]<0.05)/250, TypeI.Err=sum(mod3$re.ind.pvalue[c(1:500, 751:m), "Cell2.Severity"]<0.05)/(m-250)) rr.cell2.sex <- c(Power=sum(mod3$re.ind.pvalue[751:1000, "Cell2.Sex"]<0.05)/250, TypeI.Err=sum(mod3$re.ind.pvalue[c(1:750, 1001:m), "Cell2.Sex"]<0.05)/(m-250)) rr.others <- c(Power=NA, TypeI.Err=sum(mod3$re.ind.pvalue[, -c(1,4,10)]<0.05)/(m*8)) retab3 <- cbind(Cell1=rr.cell1, Severity=rr.severity, Cell2.Severity=rr.cell2.severity, Cell2.Sex=rr.cell2.sex, Others=rr.others) kable(round(retab3, 3))
The first table shows the results using sample-specific weights while the second table shows the results without sample-specific weights. We see that for all five covariates, the type I error is controlled at the nominal level for both methods.
When weights are used, we have relatively higher power to detect gene-specific associations between the covariates, especially for the interactions between cell proportion, clinical variables, and gene expressions.
Note that in our implementation of the weighted FastMix model, we can specify a much general covariance structure that could include both heterogeneous variances (as shown in the above example) and correlation among samples, possibly due to serial correlation in a longitudinal study. Howver, as of now, we require that such correlation is known (or estimable from the data). Please see [Yun and Xing 2019 BMC Bioinformatics] for a moment-based method for estimating such complex covariance structure. Also please see our current paper for more technical details of the weighted FastMix method.
In this section, we will present an example that uses the FastMix model to conduct a discriminant analysis for a specified binary outcome, such as the Severity
variable stored in our simulated data, dat_train
. Please see our paper for technical details of our approaches.
As in any practical discriminant analysis, models developed from a training dataset (dat_train
in this example) should be validated on an independent testing data set. To this end, we generate such a test data set, with the same simulation settings as dat_train
and a larger sample size ($n=200$), and name it as dat_test
.
Due to technical reasons, the test dataset needs to be pre-processed before model development. This can be done by calling function DataPrep_test()
, and the results are designated as Data_1
and Data_0
. We need to provide with DataPrep_test()
the index of the binary outcome to be classified in the original data, such as Sex
in our example. After this processing step, DataPrep_test()
will automatically rename it as Res
(stands for "Response"), and all the other covariates as Var1
, Var2
, etc. Note that by construction, the Res
variable in these sets will be constants across all subjects with opposite signs.
#================================================================================# #================= step 1: prepare the test data set ===========================# #================================================================================# ### this step cannot be wrap into the score function because we need ### to check the Data2 to find the index ## number of subjects in the test data n2 <- 200 gnames <- rownames(dat_test$GeneExp); m <- nrow(dat_test$GeneExp) if (is.null(gnames)) { rownames(dat_test$GeneExp) <- gnames <- paste0("Gene", 1:m) } ### In reality, we do not know the true response status of the testing ### data. However, we need to create a "Response" for these subjects ### in order to do a proper standardization for the covariates. These ### two sets of data are standardized according to two different ### "hypothetical" responses: Response==0 (Data_0), Response==1 ### (Data_1). ### this code use the data in csSAM simulation. The difference is that Demo only includes response #test_data = DataPrep_test(dat_test$GeneExp, dat_test$CellProp, Demo=NULL, train_response = dat$Demo, include.demo=TRUE, w) ### this code use the data in main simulation. The difference is that Demo includes other variable than response test_data = DataPrep_test(dat_test$GeneExp, dat_test$CellProp, Demo=dat_test$Demo[,1], train_Demo = dat_train$Demo[,1], train_response = dat_train$Demo[,2], include.demo=TRUE) ### here, we show an exmaple of test data set with include_demo = FALSE test_data_F = DataPrep_test(dat_test$GeneExp, dat_test$CellProp, Demo=dat_test$Demo[,1], train_Demo = dat_train$Demo[,1], train_response = dat_train$Demo[,2], include.demo=FALSE) Data_0 = test_data$Data_0 Data_1 = test_data$Data_1 Data_0_F = test_data_F$Data_0 Data_1_F = test_data_F$Data_1 # Due to the way we assign hypothetical Response values, all entries # in Data_0 has "Response==0", and all entries in Data_0 has # one. Therefore, we expect the standardized Response from these two # datasets: (a) are the same for all entries within one dataset; (b) # have exactly the opposite sign between these two datasets. round(head(Data_1$X[,"Res"]),4) round(head(Data_0$X[,"Res"]),4) ###when include.demo=FALSE, try the following code round(head(Data_1_F$X[,"Cell1.Res"]),4) round(head(Data_0_F$X[,"Cell1.Res"]),4)
Note that we must let the scoring function know that which variable in the testing data is the binary outcome to be classified (function DataPrep_test()
automatically renames this variable as Res
in the processed data, and renames all other covariates as Vari
, $i=1,2,\dots$), and which variables are the interactions between Res
and cell proportions.
By checking the first few rows of the processed data, we find that the $5$th column is the binary response variable, and the $9, 10, 11$th columns are the interaction terms between the response and cell proportions. This information will be used in score_func()
to produce several discriminant scores. the single_score
(one dimensional) and multi_score
(5,000-dimensional) are computed from all genes based on the theory of linear discriminant analysis; single_sparse_score
(one dimensional) and multi_sparse_score
are linear combinations of genes with significant interactions with the response variable.
Note that by selecting a more stringent significant level (via the sig.level
parameter in score_func()
) in DiscrScores2
, we obtain a sparse discriminant score based on much fewer genes.
#================================================================================# #=============== step 3: clacluate the scores for classification ================# #================================================================================# # we can overview the data to find some useful index for the score # calculation. In this example, the 'Res' variable is the 5th column; # and the 9,10,11th columns are the interaction terms needed for # building the discriminant score. kable(round(head(Data_0$X),2)) ### the predicted score using main data set DiscrScores = score_func(mod1=mod1, Data_0, Data_1, Response_interaction_index = c(9,10,11), Response_idx=5) ###this example shows how to use the score function with include_demo = FALSE DiscrScores_F = score_func(mod1=mod1_F, Data_0_F, Data_1_F, Response_interaction_index = c(7,8,9), Response_idx=NULL) ## we can use a more stringent significant level to select fewer ## features in the sparse discriminant score DiscrScores2 = score_func(mod1=mod1, Data_0, Data_1, Response_interaction_index = c(9,10,11), Response_idx=5, sig.level=0.005) ## check the number of selected sig. interaction genes nrow(DiscrScores$multi_sparse_score) #797 sig. genes nrow(DiscrScores2$multi_sparse_score) #only 121 sig. genes
Finally, we show some summary plots and tables for the classification results.
par(mfrow=c(3,1)) xl <- c(min(DiscrScores$single_score)-2, max(DiscrScores$single_score)+2) hist(DiscrScores$single_score[1:n2/2], xlim = xl, main = "Distribution of the overall score", xlab = "Overall score", col=rgb(0.95,0.95,0.95,0.5)) hist(DiscrScores$single_score[(n2/2+1):n2], col=rgb(0.05,0.05,0.05,0.5), add=TRUE) legend("topright", c("Res = 0", "Res = 1"), col = c(rgb(0.95,0.95,0.95,0.5), rgb(0.05,0.05,0.05,0.5)), lwd = 10, cex = 0.5) xl <- c(min(DiscrScores$single_sparse_score)-2, max(DiscrScores$single_sparse_score)+2) hist(DiscrScores$single_sparse_score[1:n2/2], xlim = xl, main = "Distribution of the sparse score 1", xlab = "Sparse score", col=rgb(0.95,0.95,0.95,0.5)) hist(DiscrScores$single_sparse_score[(n2/2+1):n2], col=rgb(0.05,0.05,0.05,0.5), add=TRUE) legend("topright", c("Res = 0", "Res = 1"), col = c(rgb(0.95,0.95,0.95,0.5), rgb(0.05,0.05,0.05,0.5)), lwd = 10, cex = 0.5) ## the second sparse score; with fewer features xl <- c(min(DiscrScores2$single_sparse_score)-2, max(DiscrScores2$single_sparse_score)+2) hist(DiscrScores2$single_sparse_score[1:n2/2], xlim = xl, main = "Distribution of the sparse score 2", xlab = "Sparse score", col=rgb(0.95,0.95,0.95,0.5)) hist(DiscrScores2$single_sparse_score[(n2/2+1):n2], col=rgb(0.05,0.05,0.05,0.5), add=TRUE) legend("topright", c("Res = 0", "Res = 1"), col = c(rgb(0.95,0.95,0.95,0.5), rgb(0.05,0.05,0.05,0.5)), lwd = 10, cex = 0.5) ## create a summary table single_score = c(mean(DiscrScores$single_score[1:n2/2]), sd(DiscrScores$single_score[1:n2/2]), mean(DiscrScores$single_score[(n2/2+1):n2]), sd(DiscrScores$single_score[(n2/2+1):n2])) single_sparse_score = c(mean(DiscrScores$single_sparse_score[1:n2/2]), sd(DiscrScores$single_sparse_score[1:n2/2]), mean(DiscrScores$single_sparse_score[(n2/2+1):n2]), sd(DiscrScores$single_sparse_score[(n2/2+1):n2])) single_sparse_score2 = c(mean(DiscrScores2$single_sparse_score[1:n2/2]), sd(DiscrScores2$single_sparse_score[1:n2/2]), mean(DiscrScores2$single_sparse_score[(n2/2+1):n2]), sd(DiscrScores2$single_sparse_score[(n2/2+1):n2])) summy_stat = rbind(single_score, single_sparse_score, single_sparse_score2) ### signal to noise ratio summy_stat = cbind(summy_stat, (summy_stat[,3]- summy_stat[,1])/(summy_stat[,2]+summy_stat[,4])/2/sqrt(4/n2)) colnames(summy_stat) = c("mean0", "sd0", "mean1", "sd1", "SNR") rownames(summy_stat) = c("Using all genes", paste0("Sparse score 1 (", nrow(DiscrScores$multi_sparse_score), " genes)"), paste0("Sparse score 2 (", nrow(DiscrScores2$multi_sparse_score), " genes)")) kable(round(summy_stat, 3))
From the plots and tables, we see that both the one-dimensional overall and sparse scores have discriminant powers to separate the two gender groups, and the sparse score seems to have better signal-to-noise ration (SNR) than the overall score. While sparse score 2 has slightly lower SNR compared with sparse score 1, it still has good discriminant power and can be a practical option (because it uses much fewer features) in real world applications.
These 1-dimensional scores, as well as their multivariate counterparts (entries multi_score
and multi_sparse_score
in objects DiscrScores
and DiscrScores2
), can be used in linear or nonlinear machine learning algorithms to build the actual classifiers for the binary outcome.
To be written later based on the UR/JCVI collaboration on the RPRC data.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.