PRECISION
is a package that allows users to conduct a re-sampling-based simulation study for molecular classification, using a unique pair of Agilent microRNA array datasets [^1], [^1b]. This simulation study is to illustrate the intricate interplay between data generation, data preprocessing, and classification error estimation for molecular classification. It offers insights into the desired practice of study design and data analysis for molecular classification, so that research sources can be optimized to generate high-quality molecular data and develop reproducible classifiers.
This document familiarizes users with the data and the functions available in the package. It also allows users to explore further on the topic using additional normalization and classification methods.
[^1]: Qin LX, Huang HC, Begg CB. Cautionary note on cross validation in molecular classification. Journal of Clinical Oncology. 2016, http://ascopubs.org/doi/abs/10.1200/JCO.2016.68.1031.
[^1b]: Huang HC, Qin LX. PRECISION: an R package for assessing study design and data normalization for molecular classification studies using a unique pair of microarray datasets (2016). GitHub repository, https://github.com/rebeccahch/precision.
# load pre-stored R objects for the vignette. load(file = "prestore.obj.Rdata")
When using PRECISION
, please cite the following papers:
Huang HC, Qin LX. PRECISION: an R package for assessing study design and data normalization for molecular classification studies using a unique pair of microarray datasets (2016). GitHub repository, https://github.com/rebeccahch/precision.
Qin LX, Huang HC, Begg CB. Cautionary note on cross validation in molecular classification. Journal of Clinical Oncology. 2016, http://ascopubs.org/doi/abs/10.1200/JCO.2016.68.1031.
The first step is to install PRECISION
from CRAN
by typing the following command in R console:
source(file = "../R/extract.precision.error.R") source(file = "../R/plot.precision.R")
if(!require(PRECISION)) install.packages("PRECISION") library(PRECISION)
Two datasets was generated for the same set of 96 endometrial and 96 ovarian tumor samples using different experimental handling designs. The first dataset was handled by one technician in one run and its arrays were randomly assigned to tumor samples with blocking (by array slide, uniformly-handled data). The second dataset was handled by two technicians in multiple batches and its arrays were assigned in the order of sample collection (without blocking or randomization), to mimic typical practice (nonuniformly-handled data). More details on the data collection can be found in Qin et al.[^2] The datasets are publicly available at GEO
here.
[^2]: Qin LX, Zhou Q, Bogomolniy F, Villafania L, Olvera N, Cavatore M, Satagopan JM, Begg CB, Levine DA. Blocking and randomization to improve molecular biomarker discovery. Clinical Cancer Research 2014, 20:3371-3378 here
To save time for package loading, this package includes two example datasets (probe level): uniformly-handled data uhdata.pl
and nonuniformly-handled data nuhdata.pl
, which are the 5% random subset of markers from the original data. Here is a glimpse of the datasets. The last character of the sample labels "E" or "V" indicates whether the sample is an endometrial or ovarian tumor. To load the data, simply use the commands below:
data("uhdata.pl") data("nuhdata.pl")
knitr::kable(uhdata.pl[1:15, 1:5], caption = "Uniformly-handled Data") knitr::kable(nuhdata.pl[1:15, 1:5], caption = "Nonuniformly-handled Data")
Initially, the number of replicates for each marker/probe varies between 10 and 40. We recommend truncating the number of replicates to a fixed number across probes to save data preprocessing time. We are confident in recommending this practice because we have previously observed that the variation among replicates for the same probe is small. Here, we provide per.unipbset.truncate()
to truncate the number of replicates in data. In our simulation study, we used only the first 10 replicates for each unique probe.
## Example of taking only the first 5 replicates for each unique probe uhdata.pl.p5 <- per.unipbset.truncate(data = uhdata.pl, num.per.unipbset = 5)
The Agilent platform includes control probes, but most of the analyses demonstrated in this document will use only non-control probe data. An exception is when batch effects are adjusted by RUV-4 (explained in a later section). The following code identifies and filters out the control probes from the data.
# negative control probes ctrl.genes <- unique(rownames(uhdata.pl))[grep("NC", unique(rownames(uhdata.pl)))] uhdata.pl.nc <- uhdata.pl[!rownames(uhdata.pl) %in% ctrl.genes, ] # nc for non-control probes
We used the uniformly-handled dataset to approximate the biological effect for each sample, and the difference between the two arrays (one from the uniformly-handled dataset and the other from the nonuniformly-handled dataset) for the same sample to approximate the handling effect for each array in the nonuniformly-handled dataset. This is done with estimate.biological.effect()
and estimate.handling.effect()
as follows:
biological.effect <- estimate.biological.effect(uhdata = uhdata.pl) handling.effect <- estimate.handling.effect(uhdata = uhdata.pl, nuhdata = nuhdata.pl) biological.effect.nc <- biological.effect[!rownames(biological.effect) %in% ctrl.genes, ] handling.effect.nc <- handling.effect[!rownames(handling.effect) %in% ctrl.genes, ]
The 192 samples were randomly split in a 2:1 ratio into a training set and a test set, balanced by tumor type. The 192 arrays were non-randomly split to a training set (n=128, the first 64 and last 64 arrays in the order of array processing). The first 80 arrays were profiled by one technician and the rest by the other technician.
set.seed(101) group.id <- substr(colnames(biological.effect.nc), 7, 7) # randomly split biological effect data into training and test set with equal number of endometrial and ovarian samples biological.effect.train.ind <- colnames(biological.effect.nc)[c(sample(which(group.id == "E"), size = 64), sample(which(group.id == "V"), size = 64))] biological.effect.test.ind <- colnames(biological.effect.nc)[!colnames(biological.effect.nc) %in% biological.effect.train.ind] # non-randomly split handling effect data into training and test set handling.effect.train.ind <- colnames(handling.effect.nc)[c(1:64, 129:192)] handling.effect.test.ind <- colnames(handling.effect.nc)[65:128] group.id.list <- list("all" = group.id, "tr" = substr(biological.effect.train.ind, 7, 7), "te" = substr(biological.effect.test.ind, 7, 7)) array.to.sample.assign <- list("all" = c(rep(c("E", "V"), each = 64), rep(c("V", "E"), each = 32)), "tr" = rep(c("E", "V"), each = 64), "te" = rep(c("V", "E"), each = 32))
Next, for the training set, data were simulated through "virtual re-hybridization" by first assigning arrays to sample groups using a confounding design or a balanced design, and then summing the biological effect for a sample and the handling effect for its assigned array. For the test set, we used onlyt the biological effect data from the uniformly-handled dataset. Three study design functions assign arrays to samples (confounding.design()
, stratification.design()
, blocking.design()
). Finally, rehybridize()
is provided to sum biological effects to handling effects, following array-to-sample assignments.
# complete confounding cc.ind <- confounding.design(seed = 1, num.array = 128, degree = "complete", rev.order = FALSE) # partial confounding reversed pc.rev.ind <- confounding.design(seed = 2, num.array = 128, degree = "partial", rev.order = TRUE) # stratification batch.id <- list(1:40, 41:64, (129:160) - 64, (161:192) - 64) str.ind <- stratification.design(seed = 3, num.array = 128, batch.id = batch.id) # blocking blk.ind <- blocking.design(seed = 4, num.array = 128)
Below is one of the possible array-to-sample-group splits for the study designs. Each vertical bar represents an array and the color of each stripe represents which sample group the array is assigned to.
par(mar = c(1, 1, 1, 1)) plot(1:128, pch = "", xlab = "", ylab = "", yaxt = "n", xaxt = "n", ylim = c(0.25, 2.5)) points(1:128, rep(2, 128), col = ifelse(cc.ind < 65, 2, 4), pch = "|") points(1:128, rep(1.5, 128), col = ifelse(pc.rev.ind < 65, 2, 4), pch = "|") points(1:128, rep(1, 128), col = ifelse(str.ind < 65, 2, 4), pch = "|") points(1:128, rep(0.5, 128), col = ifelse(blk.ind < 65, 2, 4), pch = "|") text(x = 64.5, y = 2.25, labels = "Complete confounding") text(x = 64.5, y = 1.75, labels = "Partial confounding reversed") text(x = 64.5, y = 1.25, labels = "Stratification") text(x = 64.5, y = 0.75, labels = "Blocking")
assign.ind <- confounding.design(seed = 1, num.array = 192, degree = "complete", rev.order = FALSE) group.id <- substr(colnames(biological.effect.nc), 7, 7) # re-hybridize sim.data.raw <- rehybridize(biological.effect = biological.effect.nc, handling.effect = handling.effect.nc, group.id = group.id, array.to.sample.assign = assign.ind) # re-hybridize + correct batch effects with SVA sim.data.sva <- rehybridize(biological.effect = biological.effect.nc, handling.effect = handling.effect.nc, group.id = group.id, array.to.sample.assign = assign.ind, isva = TRUE) # re-hybridize + correct batch effects with RUV-4 biological.effect.ctrl <- biological.effect[rownames(biological.effect) %in% ctrl.genes, ] handling.effect.ctrl <- handling.effect[rownames(handling.effect) %in% ctrl.genes, ] sim.data.ruv <- rehybridize(biological.effect = biological.effect.nc, handling.effect = handling.effect.nc, group.id = group.id, array.to.sample.assign = assign.ind, iruv = TRUE, biological.effect.ctrl = biological.effect.ctrl, handling.effect.ctrl = handling.effect.ctrl)
Batch effect correction is available in the re-hybridization step. The adjustment can be turned on by specifying icombat = TRUE
, isva = TRUE
or iruv = TRUE
for ComBat[^3a], sva[^3a], and RUV-4[^3b]; if none of these is selected, no batch adjustment will be performed. Note that when RUV-4 (iruv = TRUE
) is selected, control-probe data must be supplied.
[^3a]: Leek JT, Johnson WE, Parker HS, Fertig EJ, Jaffe AE and Storey JD (2016). Surrogate Variable Analysis. R package version 3.20.0. [^3b]: RUV: Johann Gagnon-Bartsch (2015). Detect and Remove Unwanted Variation using Negative Controls. R package version 0.9.6.
Data preprocessing in the study includes three steps:
log2 transformation;
normalization for training data and frozen normalization for test data (that is, mapping the empirical distribution of each individual test-set sample to the "frozen" empirical distribution of the normalized training data);
probe-replicate summarization using the median.
We provide med.norm()
, quant.norm()
, and vs.norm()
for median normalization, quantile normalization[^4], and variance stabilizing normalization[^5], respectively.
[^4]: Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185-193. Bolstad BM: preprocessCore: A collection of pre-processing functions. R package version 1.34.0. 2016. [^5]: Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 2002, 18 Suppl 1(suppl 1):S96-104.
set.seed(101) group.id <- substr(colnames(sim.data.raw), 7, 7) # randomly split data into training and test set with equal number of endometrial and ovarian samples train.ind <- colnames(biological.effect)[c(sample(which(group.id == "E"), size = 64), sample(which(group.id == "V"), size = 64))] train.dat <- sim.data.raw[, train.ind] test.dat <- sim.data.raw[, !colnames(sim.data.raw) %in% train.ind] # median normalize (normalize training data only) data.mn1 <- med.norm(train = train.dat) # median normalize (frozen normalize test data only) data.mn2 <- med.norm(test = test.dat, ref.dis = data.mn1$ref.dis) # quantile normalize (normalize training data + frozen normalize test data) data.qn <- quant.norm(train = train.dat, test = test.dat) # varaince stabilizing normalize (normalize training data + frozen normalize test data) data.vsn <- vs.norm(train = train.dat, test = test.dat)
To summarize replicate level data to unique probe level, based on within-probe medians, med.sum.pbset()
can be used:
uhdata.psl <- med.sum.pbset(data = uhdata.pl, num.per.unipbset = 10)
Regardless of classification methods, internal cross-validation can be used to select the tuning parameter(s) and external validation should be used to validate the performance. In our simulation study, we reported results from the use of two classification methods: one non-parametric method (prediction analysis for microarrays (PAM)[^6]) and one parametric method (THE least absolute shrinkage and selection operator (LASSO)[^7]). Examples of model-building and predicting functions in PRECISION
are pam.intcv()
and pam.predict()
for PAM and lasso.intcv()
and lasso.predict()
for LASSO.
[^6]: Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America 2002, 99(10):6567-6572. Hastie T, Tibshirani R., Narasimhan B, Chu G: pamr: Pam: prediction analysis for microarrays; 2014.
[^7]: Tibshirani R: Regression shrinkage and selection via the Lasso. J Roy Stat Soc B Met 1996, 58(1):267-288. Friedman J, Hastie T, Tibshirani R: Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 2010, 33(1):1-22.
set.seed(101) # randomly split biological effect data into training and test set with equal number of endometrial and ovarian samples biological.effect.train.ind <- colnames(biological.effect.nc)[c(sample(which(group.id == "E"), size = 64), sample(which(group.id == "V"), size = 64))] biological.effect.test.ind <- colnames(biological.effect.nc)[!colnames(biological.effect.nc) %in% biological.effect.train.ind] biological.effect.nc.tr <- biological.effect.nc[, biological.effect.train.ind] biological.effect.nc.te <- biological.effect.nc[, biological.effect.test.ind] # build a PAM classifier pam.int <- pam.intcv(X = biological.effect.nc.tr, y = substr(colnames(biological.effect.nc.tr), 7, 7), kfold = 5, seed = 1) # predict with the PAM classifier pam.pred <- pam.predict(pam.intcv.model = pam.int, pred.obj = biological.effect.nc.te, pred.obj.group.id = substr(colnames(biological.effect.nc.te), 7, 7)) # cross-validation misclassification error rate pam.int$mc # external validation misclassification error rate pam.pred$mc # build a LASSO classifier lasso.int <- lasso.intcv(X = biological.effect.nc.tr, y = substr(colnames(biological.effect.nc.tr), 7, 7), kfold = 5, seed = 1, alp = 1) # predict with the LASSO classifier lasso.pred <- lasso.predict(lasso.intcv.model = lasso.int, pred.obj = biological.effect.nc.te, pred.obj.group.id = substr(colnames(biological.effect.nc.te), 7, 7)) # cross-validation misclassification error rate lasso.int$mc # external validation misclassification error rate lasso.pred$mc
Finally, the main functionality of PRECISION
is to provide users with an efficient way to reproduce our simulation studies reported in JCO in press. Two wrapper functions are available to do so: one for the analysis of the uniformly-handled data and the other for the analysis of data with confounding handling (i.e. the analysis of simulated data)
The analysis of the uniformly-handled data is based on N
splits of training-and-test-set for the uniformly-handled data; uni.handled.simulate()
is used. Users can manipulate the signal level of the estimated biological effects and choose the method for data normalization and classification.
uni.handled.results <- uni.handled.simulate(seed = 1, N = 3, biological.effect = biological.effect.nc, norm.list = c("NN", "QN"), class.list = c("PAM", "LASSO"))
knitr::kable(data.frame(uni.handled.results$error_store[2, ]), caption = "Analysis of uniformly-handled data")
The analysis of data with confounding handling is based on N
array-to-sample reassignments using precision.simulate()
. Users can control the signal level of the estimated biological effects, the confounding level of the estimated handling effects, array-to-sample study design, the normalization method, the batch adjustment method, and the classification methods.
set.seed(101) group.id <- substr(colnames(biological.effect.nc), 7, 7) # randomly split biological effect data into training and test set with equal number of endometrial and ovarian samples biological.effect.train.ind <- colnames(biological.effect.nc)[c(sample(which(group.id == "E"), size = 64), sample(which(group.id == "V"), size = 64))] biological.effect.test.ind <- colnames(biological.effect.nc)[!colnames(biological.effect.nc) %in% biological.effect.train.ind] biological.effect.train.test.split = list("tr" = biological.effect.train.ind, "te" = biological.effect.test.ind) # non-randomly split handling effect data into training and test set; technician effect as proxy handling.effect.train.test.split = list("tr" = c(1:64, 129:192), "te" = 65:128) biological.effect.nc.tr <- biological.effect.nc[, biological.effect.train.ind] biological.effect.nc.te <- biological.effect.nc[, biological.effect.test.ind] handling.effect.nc.tr <- handling.effect.nc[, c(1:64, 129:192)] handling.effect.nc.te <- handling.effect.nc[, 65:128] # simulation without batch adjustment precision.results <- precision.simulate(seed = 1, N = 3, biological.effect.tr = biological.effect.nc.tr, biological.effect.te = biological.effect.nc.te, handling.effect.tr = handling.effect.nc.tr, handling.effect.te = handling.effect.nc.te, group.id.tr = substr(colnames(biological.effect.nc.tr), 7, 7), group.id.te = substr(colnames(biological.effect.nc.te), 7, 7), design.list = c("PC-", "STR"), norm.list = c("NN", "QN"), class.list = c("PAM", "LASSO"), batch.id = list(1:40, 41:64, (129:160) - 64, (161:192) - 64)) # simulation with RUV-4 batch adjustment # (bringing in control-probe data) biological.effect.tr.ctrl <- biological.effect.ctrl[, biological.effect.train.test.split$tr] handling.effect.tr.ctrl <- handling.effect.ctrl[, handling.effect.train.test.split$tr] precision.ruv4.results <- precision.simulate(seed = 1, N = 3, biological.effect.tr = biological.effect.nc.tr, biological.effect.te = biological.effect.nc.te, handling.effect.tr = handling.effect.nc.tr, handling.effect.te = handling.effect.nc.te, group.id.tr = substr(colnames(biological.effect.nc.tr), 7, 7), group.id.te = substr(colnames(biological.effect.nc.te), 7, 7), design.list = c("PC-", "STR"), norm.list = c("NN", "QN"), class.list = c("PAM", "LASSO"), batch.id = list(1:40, 41:64, (129:160) - 64, (161:192) - 64), iruv = TRUE, biological.effect.tr.ctrl = biological.effect.tr.ctrl, handling.effect.tr.ctrl = handling.effect.tr.ctrl)
knitr::kable(data.frame(precision.results$error_store[2, "PC-"]), aligh = "l", caption = "Analysis of simulated data in presence of confounding handling") knitr::kable(data.frame(precision.ruv4.results$error_store[2, "PC-"]), aligh = "l", caption = "Analysis of simulated data in presence of confounding handling, adjusting batch effects with RUV-4")
Both uni.handled.simulate()
and precision.simulate()
allows user-defined function for normalization or classification method comparison purposes.
A user-defined normalization method function must allow at least these two input parameters: train
and test
(test
can be NULL
). The user is required to provide a short name for the normalization method and use the short name (in lower case) when defining the output. For example, if the short name is "RN", the function must return a list of two outputs: train.rn
and test.frn
for both normalized training and frozen-normalized test.
# an example of user-defined normalization method function: # rand.norm() randomly shuffles the sample IDs "rand.norm" <- function(train, test = NULL){ stopifnot(nrow(train) == nrow(test)) if(is.null(test)) { test.frn <- NULL } else{ train.rn <- apply(train, 2, sample) dimnames(train.rn) <- dimnames(train) test.frn <- apply(test, 2, sample) dimnames(test.frn) <- dimnames(test) } return(list("train.rn" = train.rn, "test.frn" = test.frn)) } uni.handled.results.rn <- uni.handled.simulate(seed = 1, N = 3, biological.effect = biological.effect.nc, norm.list = c("NN", "QN", "RN"), class.list = c("PAM", "LASSO"), norm.funcs = "rand.norm")
knitr::kable(data.frame(uni.handled.results.rn$error_store[2, ]), caption = "Analysis of uniformly-handled data (with RN)")
For each user-defined classification method, the user must provide two functions: one for building a model and one for predicting. The build function must allow input parameters kfold
, X
, y
, seed
, and must return a list of outputs at minimum including mc
, model
(naming must match). The predict function must allow three input parameters for the build model, object to be predicted, and group ID of the object to be predicted (note: order matters). Lastly, the predict function must return a list of outputs at minimum including mc
for misclassification error rate (naming must match). It is not required but is highly recommended to return predicted probabilities and selected features for future uses.
# example of user-defined classification method functions: "build.ridge" <- function(kfold = 5, X, y, seed, alp = 0){ ptm <- proc.time() set.seed(seed) cv.fit <- glmnet::cv.glmnet(x = data.matrix(t(X)), y = factor(y), family = "binomial", type.measure = "class", alpha = alp, nfold = kfold) mc <- cv.fit$cvm[which(cv.fit$lambda == cv.fit$lambda.1se)] coefs <- coef(cv.fit, s = "lambda.1se") time <- proc.time() - ptm return(list(mc=mc, time=time, model=cv.fit, cfs=coefs)) } # ridge has the same prediction function as LASSO uni.handled.results.ridge <- uni.handled.simulate(seed = 1, N = 3, biological.effect = biological.effect.nc, norm.list = c("NN", "RN"), class.list = c("LASSO", "Ridge"), norm.funcs = "rand.norm", class.funcs = "build.ridge", pred.funcs = "lasso.predict")
knitr::kable(data.frame(uni.handled.results.ridge$error_store[2, ]), caption = "Analysis of uniformly-handled data (with Ridge)")
A more flexible function precision.simulate.flex()
is built upon the aforementioned precision.simulate()
that can not only perform analyses reported in the JCO paper faster but can also allow two external validation methods and different array-to-sample study designs and data normalization methods in training and test sets.
The two external validation methods offered by this function are either using uniformly-handled test set
or using nonuniformly-handled test set
as the independent validation set. The first method is the method used in the JCO paper.
Furthermore, the function enables in-depth study of the interplay between array-to-sample study designs and data normalization methods. Along with the previously mentioned user-defined normalization method function, users can compare different combinations of normalization method on a training set and frozen normalization method on a test set as well. Below is an example set-up for the simulation analysis using precision.simulate.flex()
:
precision.results.flex <- precision.simulate.flex(seed = 1, N = 3, # additional external validation method valid.list = c("int", "ext.uh", "ext.sim.nuh"), biological.effect.tr = biological.effect.nc.tr, biological.effect.te = biological.effect.nc.te, handling.effect.tr = handling.effect.nc.tr, handling.effect.te = handling.effect.nc.te, group.id.tr = substr(colnames(biological.effect.nc.tr), 7, 7), group.id.te = substr(colnames(biological.effect.nc.te), 7, 7), # specify different study designs on the nonuniformly-handled data design.tr.list = c("PC-", "PC-", "STR", "STR"), design.te.list = c("PC-", "STR", "PC-", "STR"), # allow different data normalization on training and test set norm.tr.list = c("NN", "QN", "NN", "QN"), norm.te.list = c("NN", "NN", "QN", "QN"), class.list = c("PAM", "LASSO"), batch.id.tr = list(1:40, 41:64, (129:160) - 64, (161:192) - 64), batch.id.te = list(1:32, 33:64))
Even Analysis with uniformly-handled data can be done using precision.simulate.flex()
. A possible solution is as follows.
# create empty nonuniformly-handled training and test sets (of the same dimension as the uniformly-handled training and test sets) emp.handling.effect.nc.tr <- matrix(0, ncol = ncol(biological.effect.nc.tr), nrow = nrow(biological.effect.nc.tr)) emp.handling.effect.nc.te <- matrix(0, ncol = ncol(biological.effect.nc.te), nrow = nrow(biological.effect.nc.te)) # create a fucntion to randomly split the biological set into training and test? "biological.effect.train.test.split.func" <- function(seed){ set.seed(seed) biological.effect.train.ind <- colnames(biological.effect.nc)[c(sample(which( group.id == "E"), size = 64), sample(which(group.id == "V"), size = 64))] biological.effect.test.ind <- colnames(biological.effect.nc)[!colnames( biological.effect.nc) %in% biological.effect.train.ind] biological.effect.train.test.split = list("tr" = biological.effect.train.ind, "te" = biological.effect.test.ind) return(biological.effect.train.test.split) } # use sapply uni.handled.results.flex <- sapply(1:10, function(x){ # random split biological.effect.train.test.split <- biological.effect.train.test.split.func(x) # simulate precision.results.flex <- precision.simulate.flex(seed = 1, N = 1, biological.effect.tr = biological.effect.nc.tr, biological.effect.te = biological.effect.nc.te, handling.effect.tr = emp.handling.effect.nc.tr, handling.effect.te = emp.handling.effect.nc.te, group.id.tr = substr(colnames(biological.effect.nc.tr), 7, 7), group.id.te = substr(colnames(biological.effect.nc.te), 7, 7), design.tr.list = c("BLK", "CC+"), design.te.list = c("BLK", "CC+"), norm.tr.list = c("QN", "NN"), norm.te.list = c("NN", "QN"), class.list = c("PAM", "LASSO")) })
PRECISION
output R
objects resulted from precision.simulate()
can be tidied into a list of data frames by extract.precision.error()
. The list of data frames is easier to manipulate. A plotting function plot.precision()
is offered to have a quick visualization on the simulation results.
# for outputs from either precision.simulate() or precision.simulate.flex() precision.err.df <- extract.precision.error( precision.obj.name = "precision.results") # for PRECISION output from precision.simulate() plot.precision(data = precision.err.df, iflex = FALSE, mytitle = "PRECISION results (misclassification error rate)")
# for PRECISION output from precision.simulate.flex() plot.precision(data = precision.results.flex)
We provide limma.pbset
to perform differential expression analysis. For instance, when reducing a biological signal, we first identified differentially expressed markers and then shrunk the sample group differences of each of the markers.
uhdata.psl.nc <- uhdata.psl[!rownames(uhdata.psl) %in% ctrl.genes, ] # nc for non-control probes group.id <- substr(colnames(uhdata.psl.nc), 7, 7) group.id.level <- levels(as.factor(group.id)) limma.fit.uhdata<- limma.pbset(data = uhdata.psl.nc, group.id = group.id, group.id.level = group.id.level) table(limma.fit.uhdata$P.Value < 0.01, dnn = "DE genes")
tab <- data.frame(table(limma.fit.uhdata$P.Value < 0.01)) colnames(tab) <- c("DEA", "Count") knitr::kable(tab, rownames = NULL)
It may be necessary to manipulate the level of the biological signal (i.e., the mean group difference between sample group) in a molecular classification study. To manipulate the level of the signal, reduce.signal()
can be used.
# reduced signal by half group.id <- substr(colnames(biological.effect.nc), 7, 7) redhalf.biological.effect.nc <- reduce.signal(biological.effect = biological.effect.nc, group.id = group.id, group.id.level = c("E", "V"), reduce.multiplier = 1/2) # extract differential expressed genes biological.effect.nc.psl <- med.sum.pbset(biological.effect.nc) s.e.limma.fit <- limma.pbset(data = biological.effect.nc.psl, group.id = group.id, group.id.level = c("E", "V")) de.ind <- s.e.limma.fit$P.Value < 0.01 de.gene.name <- rownames(redhalf.biological.effect.nc)[which(de.ind)][2] nonde.gene.name <- rownames(redhalf.biological.effect.nc)[-which(de.ind)][2] biological.effect.nc.de <- biological.effect.nc[de.gene.name, ] redhalf.biological.effect.nc.de <- redhalf.biological.effect.nc[de.gene.name, ] biological.effect.nc.nonde <- biological.effect.nc[nonde.gene.name, ] redhalf.biological.effect.nc.nonde <- redhalf.biological.effect.nc[nonde.gene.name, ]
group.id <- substr(colnames(biological.effect.nc), 7, 7) # plot par(mfrow = c(2, 2), mar = c(2, 2, 2, 1)) # for DE genes, the difference has been shrunken in half... boxplot(biological.effect.nc.de ~ group.id == "E", main = "DE") boxplot(redhalf.biological.effect.nc.de ~ group.id == "E", main = "DE") # for non-DE genes, the difference has not been changed boxplot(biological.effect.nc.nonde ~ group.id == "E", main = "Non-DE") boxplot(redhalf.biological.effect.nc.nonde ~ group.id == "E", main = "Non-DE")
Again, it may be necessary to manipulate the confounding level of handling effects in the simulation study in order to evaluate normalization or classification methods in the presence of handling effect. We offer three amplification methods in amplify.handling.effect()
:
location shift - per selected array, shifting all probe expressions up or down by a constant.
scale change 1 - per selected array, expanding the variance of probes towards the array's inter-quartiles; probes that are outside of inter-quartiles remain unchanged.
scale change 2 - scaling all probe expressions in arrays by the power of constants that are different for each batch. In our training data, there are four batches.
handling.effect.nc.tr <- handling.effect.nc[, c(1:64, 129:192)] # location shift handling.effect.nc.tr.shift <- amplify.handling.effect(handling.effect = handling.effect.nc.tr, amplify.array.id = colnames(handling.effect.nc.tr)[1:64], amplify.level = 2, type = "shift") # scale change 1 handling.effect.nc.tr.scale1 <- amplify.handling.effect(handling.effect = handling.effect.nc.tr, amplify.array.id = colnames(handling.effect.nc.tr)[1:64], amplify.level = 2, type = "scale1") # scale change 2 amplify.array.id <- list(1:40, 41:64, (129:160) - 64, (161:192) - 64) for(i in 1:length(amplify.array.id)) amplify.array.id[[i]] <- colnames(handling.effect.nc.tr)[amplify.array.id[[i]]] amplify.level <- c(1.2, 1.3, 1/3, 2/3) handling.effect.nc.tr.scale2 <- amplify.handling.effect(handling.effect = handling.effect.nc.tr, amplify.array.id = amplify.array.id, amplify.level = amplify.level, type = "scale2") par(mfrow = c(2, 2), mar = c(2, 2, 2, 1)) rng <- range(handling.effect.nc.tr, handling.effect.nc.tr.shift, handling.effect.nc.tr.scale1, handling.effect.nc.tr.scale2) boxplot(handling.effect.nc.tr, main = "Original", ylim = rng, pch = 20, cex = 0.2, xaxt = "n") boxplot(handling.effect.nc.tr.shift, main = "Shift", ylim = rng, pch = 20, cex = 0.2, xaxt = "n") boxplot(handling.effect.nc.tr.scale1, main = "Scaling 1", ylim = rng, pch = 20, cex = 0.2, xaxt = "n") boxplot(handling.effect.nc.tr.scale2, main = "Scaling 2", ylim = rng, pch = 20, cex = 0.2, xaxt = "n")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.