Introduction

PRECISION is a package that allows users to conduct a re-sampling-based simulation study for molecular classification, using a unique pair of Agilent microRNA array datasets [^1], [^1b]. This simulation study is to illustrate the intricate interplay between data generation, data preprocessing, and classification error estimation for molecular classification. It offers insights into the desired practice of study design and data analysis for molecular classification, so that research sources can be optimized to generate high-quality molecular data and develop reproducible classifiers.

This document familiarizes users with the data and the functions available in the package. It also allows users to explore further on the topic using additional normalization and classification methods.

[^1]: Qin LX, Huang HC, Begg CB. Cautionary note on cross validation in molecular classification. Journal of Clinical Oncology. 2016, http://ascopubs.org/doi/abs/10.1200/JCO.2016.68.1031.

[^1b]: Huang HC, Qin LX. PRECISION: an R package for assessing study design and data normalization for molecular classification studies using a unique pair of microarray datasets (2016). GitHub repository, https://github.com/rebeccahch/precision.

# load pre-stored R objects for the vignette.
load(file = "prestore.obj.Rdata")

When using PRECISION, please cite the following papers:

Package Overview

The first step is to install PRECISION from CRAN by typing the following command in R console:

source(file = "../R/extract.precision.error.R")
source(file = "../R/plot.precision.R")
if(!require(PRECISION)) install.packages("PRECISION")
library(PRECISION)

Data Description

Two datasets was generated for the same set of 96 endometrial and 96 ovarian tumor samples using different experimental handling designs. The first dataset was handled by one technician in one run and its arrays were randomly assigned to tumor samples with blocking (by array slide, uniformly-handled data). The second dataset was handled by two technicians in multiple batches and its arrays were assigned in the order of sample collection (without blocking or randomization), to mimic typical practice (nonuniformly-handled data). More details on the data collection can be found in Qin et al.[^2] The datasets are publicly available at GEO here.

[^2]: Qin LX, Zhou Q, Bogomolniy F, Villafania L, Olvera N, Cavatore M, Satagopan JM, Begg CB, Levine DA. Blocking and randomization to improve molecular biomarker discovery. Clinical Cancer Research 2014, 20:3371-3378 here

To save time for package loading, this package includes two example datasets (probe level): uniformly-handled data uhdata.pl and nonuniformly-handled data nuhdata.pl, which are the 5% random subset of markers from the original data. Here is a glimpse of the datasets. The last character of the sample labels "E" or "V" indicates whether the sample is an endometrial or ovarian tumor. To load the data, simply use the commands below:

data("uhdata.pl")
data("nuhdata.pl")
knitr::kable(uhdata.pl[1:15, 1:5], caption = "Uniformly-handled Data")

knitr::kable(nuhdata.pl[1:15, 1:5], caption = "Nonuniformly-handled Data")

Probe Selection

Initially, the number of replicates for each marker/probe varies between 10 and 40. We recommend truncating the number of replicates to a fixed number across probes to save data preprocessing time. We are confident in recommending this practice because we have previously observed that the variation among replicates for the same probe is small. Here, we provide per.unipbset.truncate() to truncate the number of replicates in data. In our simulation study, we used only the first 10 replicates for each unique probe.

## Example of taking only the first 5 replicates for each unique probe
uhdata.pl.p5 <- per.unipbset.truncate(data = uhdata.pl,
                                      num.per.unipbset = 5)

The Agilent platform includes control probes, but most of the analyses demonstrated in this document will use only non-control probe data. An exception is when batch effects are adjusted by RUV-4 (explained in a later section). The following code identifies and filters out the control probes from the data.

# negative control probes
ctrl.genes <- unique(rownames(uhdata.pl))[grep("NC", unique(rownames(uhdata.pl)))]

uhdata.pl.nc <- uhdata.pl[!rownames(uhdata.pl) %in% ctrl.genes, ] # nc for non-control probes

Data Simulation

We used the uniformly-handled dataset to approximate the biological effect for each sample, and the difference between the two arrays (one from the uniformly-handled dataset and the other from the nonuniformly-handled dataset) for the same sample to approximate the handling effect for each array in the nonuniformly-handled dataset. This is done with estimate.biological.effect() and estimate.handling.effect() as follows:

biological.effect <- estimate.biological.effect(uhdata = uhdata.pl)

handling.effect <- estimate.handling.effect(uhdata = uhdata.pl, nuhdata = nuhdata.pl)

biological.effect.nc <- biological.effect[!rownames(biological.effect) %in% ctrl.genes, ]
handling.effect.nc <- handling.effect[!rownames(handling.effect) %in% ctrl.genes, ]

The 192 samples were randomly split in a 2:1 ratio into a training set and a test set, balanced by tumor type. The 192 arrays were non-randomly split to a training set (n=128, the first 64 and last 64 arrays in the order of array processing). The first 80 arrays were profiled by one technician and the rest by the other technician.

set.seed(101)
group.id <- substr(colnames(biological.effect.nc), 7, 7)

# randomly split biological effect data into training and test set with equal number of endometrial and ovarian samples
biological.effect.train.ind <- colnames(biological.effect.nc)[c(sample(which(group.id == "E"), size = 64), sample(which(group.id == "V"), size = 64))] 
biological.effect.test.ind <- colnames(biological.effect.nc)[!colnames(biological.effect.nc) %in% biological.effect.train.ind]

# non-randomly split handling effect data into training and test set
handling.effect.train.ind <- colnames(handling.effect.nc)[c(1:64, 129:192)]
handling.effect.test.ind <- colnames(handling.effect.nc)[65:128]

group.id.list <- list("all" = group.id,
                 "tr" = substr(biological.effect.train.ind, 7, 7),
                 "te" = substr(biological.effect.test.ind, 7, 7))
array.to.sample.assign <- list("all" = c(rep(c("E", "V"), each = 64),
                                    rep(c("V", "E"), each = 32)),
                          "tr" = rep(c("E", "V"), each = 64),
                          "te" = rep(c("V", "E"), each = 32))

Next, for the training set, data were simulated through "virtual re-hybridization" by first assigning arrays to sample groups using a confounding design or a balanced design, and then summing the biological effect for a sample and the handling effect for its assigned array. For the test set, we used onlyt the biological effect data from the uniformly-handled dataset. Three study design functions assign arrays to samples (confounding.design(), stratification.design(), blocking.design()). Finally, rehybridize() is provided to sum biological effects to handling effects, following array-to-sample assignments.

# complete confounding
cc.ind <- confounding.design(seed = 1, num.array = 128, 
                               degree = "complete", rev.order = FALSE)

# partial confounding reversed
pc.rev.ind <- confounding.design(seed = 2, num.array = 128, 
                               degree = "partial", rev.order = TRUE)

# stratification
batch.id <- list(1:40, 41:64, (129:160) - 64, (161:192) - 64)
str.ind <- stratification.design(seed = 3, num.array = 128,
                                       batch.id = batch.id)

# blocking
blk.ind <- blocking.design(seed = 4, num.array = 128)

Below is one of the possible array-to-sample-group splits for the study designs. Each vertical bar represents an array and the color of each stripe represents which sample group the array is assigned to.

par(mar = c(1, 1, 1, 1))
plot(1:128, pch = "",
     xlab = "", ylab = "", yaxt = "n", xaxt = "n", 
     ylim = c(0.25, 2.5))

points(1:128, rep(2, 128), 
     col = ifelse(cc.ind < 65, 2, 4), pch = "|")

points(1:128, rep(1.5, 128), 
     col = ifelse(pc.rev.ind < 65, 2, 4), pch = "|")

points(1:128, rep(1, 128), 
     col = ifelse(str.ind < 65, 2, 4), pch = "|")

points(1:128, rep(0.5, 128), 
     col = ifelse(blk.ind < 65, 2, 4), pch = "|")

text(x = 64.5, y = 2.25, labels = "Complete confounding")
text(x = 64.5, y = 1.75, labels = "Partial confounding reversed")
text(x = 64.5, y = 1.25, labels = "Stratification")
text(x = 64.5, y = 0.75, labels = "Blocking")
assign.ind <- confounding.design(seed = 1, num.array = 192, 
                               degree = "complete", rev.order = FALSE)
group.id <- substr(colnames(biological.effect.nc), 7, 7)

# re-hybridize
sim.data.raw <- rehybridize(biological.effect = biological.effect.nc,
                            handling.effect = handling.effect.nc,
                            group.id = group.id,
                            array.to.sample.assign = assign.ind)

# re-hybridize + correct batch effects with SVA
sim.data.sva <- rehybridize(biological.effect = biological.effect.nc,
                            handling.effect = handling.effect.nc,
                            group.id = group.id,
                            array.to.sample.assign = assign.ind, 
                            isva = TRUE)

# re-hybridize + correct batch effects with RUV-4
biological.effect.ctrl <- biological.effect[rownames(biological.effect) %in% ctrl.genes, ]
handling.effect.ctrl <- handling.effect[rownames(handling.effect) %in% ctrl.genes, ]

sim.data.ruv <- rehybridize(biological.effect = biological.effect.nc,
                            handling.effect = handling.effect.nc,
                            group.id = group.id,
                            array.to.sample.assign = assign.ind, 
                            iruv = TRUE,
                            biological.effect.ctrl = biological.effect.ctrl,
                            handling.effect.ctrl = handling.effect.ctrl)

Batch effect correction is available in the re-hybridization step. The adjustment can be turned on by specifying icombat = TRUE, isva = TRUE or iruv = TRUE for ComBat[^3a], sva[^3a], and RUV-4[^3b]; if none of these is selected, no batch adjustment will be performed. Note that when RUV-4 (iruv = TRUE) is selected, control-probe data must be supplied.

[^3a]: Leek JT, Johnson WE, Parker HS, Fertig EJ, Jaffe AE and Storey JD (2016). Surrogate Variable Analysis. R package version 3.20.0. [^3b]: RUV: Johann Gagnon-Bartsch (2015). Detect and Remove Unwanted Variation using Negative Controls. R package version 0.9.6.

Data Preprocessing

Data preprocessing in the study includes three steps:

  1. log2 transformation;

  2. normalization for training data and frozen normalization for test data (that is, mapping the empirical distribution of each individual test-set sample to the "frozen" empirical distribution of the normalized training data);

  3. probe-replicate summarization using the median.

We provide med.norm(), quant.norm(), and vs.norm() for median normalization, quantile normalization[^4], and variance stabilizing normalization[^5], respectively.

[^4]: Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185-193. Bolstad BM: preprocessCore: A collection of pre-processing functions. R package version 1.34.0. 2016. [^5]: Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 2002, 18 Suppl 1(suppl 1):S96-104.

set.seed(101)
group.id <- substr(colnames(sim.data.raw), 7, 7)

# randomly split data into training and test set with equal number of endometrial and ovarian samples
train.ind <- colnames(biological.effect)[c(sample(which(group.id == "E"), size = 64), sample(which(group.id == "V"), size = 64))]

train.dat <- sim.data.raw[, train.ind]
test.dat <- sim.data.raw[, !colnames(sim.data.raw) %in% train.ind]


# median normalize (normalize training data only)
data.mn1 <- med.norm(train = train.dat)

# median normalize (frozen normalize test data only)
data.mn2 <- med.norm(test = test.dat, ref.dis = data.mn1$ref.dis)

# quantile normalize (normalize training data + frozen normalize test data)
data.qn <- quant.norm(train = train.dat, test = test.dat)

# varaince stabilizing normalize (normalize training data + frozen normalize test data)
data.vsn <- vs.norm(train = train.dat, test = test.dat)

To summarize replicate level data to unique probe level, based on within-probe medians, med.sum.pbset() can be used:

uhdata.psl <- med.sum.pbset(data = uhdata.pl, num.per.unipbset = 10)

Classifier Development and Error Estimation

Regardless of classification methods, internal cross-validation can be used to select the tuning parameter(s) and external validation should be used to validate the performance. In our simulation study, we reported results from the use of two classification methods: one non-parametric method (prediction analysis for microarrays (PAM)[^6]) and one parametric method (THE least absolute shrinkage and selection operator (LASSO)[^7]). Examples of model-building and predicting functions in PRECISION are pam.intcv() and pam.predict() for PAM and lasso.intcv() and lasso.predict() for LASSO.

[^6]: Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America 2002, 99(10):6567-6572. Hastie T, Tibshirani R., Narasimhan B, Chu G: pamr: Pam: prediction analysis for microarrays; 2014.

[^7]: Tibshirani R: Regression shrinkage and selection via the Lasso. J Roy Stat Soc B Met 1996, 58(1):267-288. Friedman J, Hastie T, Tibshirani R: Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 2010, 33(1):1-22.

set.seed(101)
# randomly split biological effect data into training and test set with equal number of endometrial and ovarian samples
biological.effect.train.ind <- colnames(biological.effect.nc)[c(sample(which(group.id == "E"), size = 64), sample(which(group.id == "V"), size = 64))] 
biological.effect.test.ind <- colnames(biological.effect.nc)[!colnames(biological.effect.nc) %in% biological.effect.train.ind]

biological.effect.nc.tr <- biological.effect.nc[, biological.effect.train.ind]
biological.effect.nc.te <- biological.effect.nc[, biological.effect.test.ind]

# build a PAM classifier
pam.int <- pam.intcv(X = biological.effect.nc.tr,
                     y = substr(colnames(biological.effect.nc.tr), 7, 7),
                     kfold = 5, seed = 1)

# predict with the PAM classifier
pam.pred <- pam.predict(pam.intcv.model = pam.int, 
                        pred.obj = biological.effect.nc.te, 
                        pred.obj.group.id = substr(colnames(biological.effect.nc.te), 7, 7))

# cross-validation misclassification error rate
pam.int$mc
# external validation misclassification error rate
pam.pred$mc

# build a LASSO classifier
lasso.int <- lasso.intcv(X = biological.effect.nc.tr,
                     y = substr(colnames(biological.effect.nc.tr), 7, 7),
                     kfold = 5, seed = 1, alp = 1)

# predict with the LASSO classifier
lasso.pred <- lasso.predict(lasso.intcv.model = lasso.int, 
                        pred.obj = biological.effect.nc.te, 
                        pred.obj.group.id = substr(colnames(biological.effect.nc.te), 7, 7))

# cross-validation misclassification error rate
lasso.int$mc
# external validation misclassification error rate
lasso.pred$mc

Wrapper Functions for Running Simulations

Finally, the main functionality of PRECISION is to provide users with an efficient way to reproduce our simulation studies reported in JCO in press. Two wrapper functions are available to do so: one for the analysis of the uniformly-handled data and the other for the analysis of data with confounding handling (i.e. the analysis of simulated data)

Analysis of the uniformly-handled data

The analysis of the uniformly-handled data is based on N splits of training-and-test-set for the uniformly-handled data; uni.handled.simulate() is used. Users can manipulate the signal level of the estimated biological effects and choose the method for data normalization and classification.

uni.handled.results <- uni.handled.simulate(seed = 1, N = 3,
                                            biological.effect = biological.effect.nc,
                                            norm.list = c("NN", "QN"),
                                            class.list = c("PAM", "LASSO"))
knitr::kable(data.frame(uni.handled.results$error_store[2, ]), 
             caption = "Analysis of uniformly-handled data")

Analysis with confounding handling

The analysis of data with confounding handling is based on N array-to-sample reassignments using precision.simulate(). Users can control the signal level of the estimated biological effects, the confounding level of the estimated handling effects, array-to-sample study design, the normalization method, the batch adjustment method, and the classification methods.

set.seed(101)
group.id <- substr(colnames(biological.effect.nc), 7, 7)

# randomly split biological effect data into training and test set with equal number of endometrial and ovarian samples
biological.effect.train.ind <- colnames(biological.effect.nc)[c(sample(which(group.id == "E"), size = 64), sample(which(group.id == "V"), size = 64))] 
biological.effect.test.ind <- colnames(biological.effect.nc)[!colnames(biological.effect.nc) %in% biological.effect.train.ind]

biological.effect.train.test.split =
  list("tr" = biological.effect.train.ind,
       "te" = biological.effect.test.ind)

# non-randomly split handling effect data into training and test set; technician effect as proxy
handling.effect.train.test.split =
  list("tr" = c(1:64, 129:192),
       "te" = 65:128)

biological.effect.nc.tr <- biological.effect.nc[, biological.effect.train.ind]
biological.effect.nc.te <- biological.effect.nc[, biological.effect.test.ind]
handling.effect.nc.tr <- handling.effect.nc[, c(1:64, 129:192)]
handling.effect.nc.te <- handling.effect.nc[, 65:128]

# simulation without batch adjustment
precision.results <- precision.simulate(seed = 1, N = 3,
                          biological.effect.tr = biological.effect.nc.tr,    
                          biological.effect.te = biological.effect.nc.te,
                          handling.effect.tr = handling.effect.nc.tr,
                          handling.effect.te = handling.effect.nc.te,
                          group.id.tr = substr(colnames(biological.effect.nc.tr), 7, 7),
                          group.id.te = substr(colnames(biological.effect.nc.te), 7, 7),
                          design.list = c("PC-", "STR"),
                          norm.list = c("NN", "QN"),
                          class.list = c("PAM", "LASSO"),
                          batch.id = list(1:40, 41:64, (129:160) - 64, (161:192) - 64))

# simulation with RUV-4 batch adjustment
#   (bringing in control-probe data)
biological.effect.tr.ctrl <- biological.effect.ctrl[, biological.effect.train.test.split$tr]
handling.effect.tr.ctrl <- handling.effect.ctrl[, handling.effect.train.test.split$tr]

precision.ruv4.results <- precision.simulate(seed = 1, N = 3,
                              biological.effect.tr = biological.effect.nc.tr,    
                              biological.effect.te = biological.effect.nc.te,
                              handling.effect.tr = handling.effect.nc.tr,
                              handling.effect.te = handling.effect.nc.te,
                              group.id.tr = substr(colnames(biological.effect.nc.tr), 7, 7),
                              group.id.te = substr(colnames(biological.effect.nc.te), 7, 7),
                              design.list = c("PC-", "STR"),
                              norm.list = c("NN", "QN"),
                              class.list = c("PAM", "LASSO"),
                              batch.id = list(1:40, 41:64, (129:160) - 64, (161:192) - 64), 
                              iruv = TRUE,
                              biological.effect.tr.ctrl = biological.effect.tr.ctrl, 
                              handling.effect.tr.ctrl = handling.effect.tr.ctrl)
knitr::kable(data.frame(precision.results$error_store[2, "PC-"]), 
             aligh = "l",
             caption = "Analysis of simulated data in presence of confounding handling")

knitr::kable(data.frame(precision.ruv4.results$error_store[2, "PC-"]), 
             aligh = "l",
             caption = "Analysis of simulated data in presence of confounding handling, adjusting batch effects with RUV-4")

Additional Simulation Scenarios

Both uni.handled.simulate() and precision.simulate() allows user-defined function for normalization or classification method comparison purposes.

A user-defined normalization method function must allow at least these two input parameters: train and test (test can be NULL). The user is required to provide a short name for the normalization method and use the short name (in lower case) when defining the output. For example, if the short name is "RN", the function must return a list of two outputs: train.rn and test.frn for both normalized training and frozen-normalized test.

# an example of user-defined normalization method function:

# rand.norm() randomly shuffles the sample IDs
"rand.norm" <- function(train, test = NULL){
  stopifnot(nrow(train) == nrow(test))

  if(is.null(test)) {
    test.frn <- NULL
  } else{
    train.rn <- apply(train, 2, sample)
    dimnames(train.rn) <- dimnames(train)

    test.frn <- apply(test, 2, sample)
    dimnames(test.frn) <- dimnames(test)
  }

  return(list("train.rn" = train.rn,
              "test.frn" = test.frn))
}

uni.handled.results.rn <- uni.handled.simulate(seed = 1, N = 3,
                                       biological.effect = biological.effect.nc,
                                       norm.list = c("NN", "QN", "RN"),
                                       class.list = c("PAM", "LASSO"), 
                                       norm.funcs = "rand.norm")
knitr::kable(data.frame(uni.handled.results.rn$error_store[2, ]), 
             caption = "Analysis of uniformly-handled data (with RN)")

For each user-defined classification method, the user must provide two functions: one for building a model and one for predicting. The build function must allow input parameters kfold, X, y, seed, and must return a list of outputs at minimum including mc, model (naming must match). The predict function must allow three input parameters for the build model, object to be predicted, and group ID of the object to be predicted (note: order matters). Lastly, the predict function must return a list of outputs at minimum including mc for misclassification error rate (naming must match). It is not required but is highly recommended to return predicted probabilities and selected features for future uses.

# example of user-defined classification method functions:
"build.ridge" <- function(kfold = 5, X, y, seed, alp = 0){
  ptm <- proc.time()
  set.seed(seed)

  cv.fit <- glmnet::cv.glmnet(x = data.matrix(t(X)), y = factor(y),
                              family = "binomial", type.measure = "class", 
                              alpha = alp, nfold = kfold)

  mc <- cv.fit$cvm[which(cv.fit$lambda == cv.fit$lambda.1se)]

  coefs <- coef(cv.fit, s = "lambda.1se")
  time <- proc.time() - ptm
  return(list(mc=mc, time=time, model=cv.fit, cfs=coefs))
}

# ridge has the same prediction function as LASSO

uni.handled.results.ridge <- uni.handled.simulate(seed = 1, N = 3,
                                   biological.effect = biological.effect.nc,
                                   norm.list = c("NN", "RN"),
                                   class.list = c("LASSO", "Ridge"), 
                                   norm.funcs = "rand.norm",
                                   class.funcs = "build.ridge", 
                                   pred.funcs = "lasso.predict")
knitr::kable(data.frame(uni.handled.results.ridge$error_store[2, ]), 
             caption = "Analysis of uniformly-handled data (with Ridge)")

A More Flexible Wrapper Function for Running Simulations

A more flexible function precision.simulate.flex() is built upon the aforementioned precision.simulate() that can not only perform analyses reported in the JCO paper faster but can also allow two external validation methods and different array-to-sample study designs and data normalization methods in training and test sets.

The two external validation methods offered by this function are either using uniformly-handled test set or using nonuniformly-handled test set as the independent validation set. The first method is the method used in the JCO paper.

Furthermore, the function enables in-depth study of the interplay between array-to-sample study designs and data normalization methods. Along with the previously mentioned user-defined normalization method function, users can compare different combinations of normalization method on a training set and frozen normalization method on a test set as well. Below is an example set-up for the simulation analysis using precision.simulate.flex():

precision.results.flex <- precision.simulate.flex(seed = 1, N = 3, 
  # additional external validation method
  valid.list = c("int", "ext.uh", "ext.sim.nuh"), 
  biological.effect.tr = biological.effect.nc.tr,
  biological.effect.te = biological.effect.nc.te,
  handling.effect.tr = handling.effect.nc.tr,
  handling.effect.te = handling.effect.nc.te,
  group.id.tr = substr(colnames(biological.effect.nc.tr), 7, 7),
  group.id.te = substr(colnames(biological.effect.nc.te), 7, 7),
  # specify different study designs on the nonuniformly-handled data
  design.tr.list = c("PC-", "PC-", "STR", "STR"), 
  design.te.list = c("PC-", "STR", "PC-", "STR"), 
  # allow different data normalization on training and test set
  norm.tr.list = c("NN", "QN", "NN", "QN"), 
  norm.te.list = c("NN", "NN", "QN", "QN"),
  class.list = c("PAM", "LASSO"),
  batch.id.tr = list(1:40,
    41:64,
    (129:160) - 64,
    (161:192) - 64),
  batch.id.te = list(1:32, 33:64))

Even Analysis with uniformly-handled data can be done using precision.simulate.flex(). A possible solution is as follows.

# create empty nonuniformly-handled training and test sets (of the same dimension as the uniformly-handled training and test sets)
emp.handling.effect.nc.tr <- matrix(0, 
                                    ncol = ncol(biological.effect.nc.tr),
                                    nrow = nrow(biological.effect.nc.tr))

emp.handling.effect.nc.te <- matrix(0, 
                                    ncol = ncol(biological.effect.nc.te),
                                    nrow = nrow(biological.effect.nc.te))


# create a fucntion to randomly split the biological set into training and test?
"biological.effect.train.test.split.func" <- function(seed){
  set.seed(seed)

  biological.effect.train.ind <- colnames(biological.effect.nc)[c(sample(which(
    group.id == "E"), size = 64), sample(which(group.id == "V"), size = 64))]
  biological.effect.test.ind <- colnames(biological.effect.nc)[!colnames(
    biological.effect.nc) %in% biological.effect.train.ind]
  biological.effect.train.test.split =
    list("tr" = biological.effect.train.ind,
         "te" = biological.effect.test.ind)

  return(biological.effect.train.test.split)
}

# use sapply
uni.handled.results.flex <- sapply(1:10, function(x){

  # random split
  biological.effect.train.test.split <- biological.effect.train.test.split.func(x)

  # simulate
  precision.results.flex <- precision.simulate.flex(seed = 1, N = 1,
    biological.effect.tr = biological.effect.nc.tr,
    biological.effect.te = biological.effect.nc.te,
    handling.effect.tr = emp.handling.effect.nc.tr,
    handling.effect.te = emp.handling.effect.nc.te,
    group.id.tr = substr(colnames(biological.effect.nc.tr), 7, 7),
    group.id.te = substr(colnames(biological.effect.nc.te), 7, 7),
    design.tr.list = c("BLK", "CC+"),
    design.te.list = c("BLK", "CC+"),
    norm.tr.list = c("QN", "NN"),
    norm.te.list = c("NN", "QN"),
    class.list = c("PAM", "LASSO"))
})

Simulation Result Manipulation and Visualization

PRECISION output R objects resulted from precision.simulate() can be tidied into a list of data frames by extract.precision.error(). The list of data frames is easier to manipulate. A plotting function plot.precision() is offered to have a quick visualization on the simulation results.

# for outputs from either precision.simulate() or precision.simulate.flex()
precision.err.df <- extract.precision.error(
  precision.obj.name = "precision.results")

# for PRECISION output from precision.simulate()
plot.precision(data = precision.err.df, iflex = FALSE, 
               mytitle = "PRECISION results (misclassification error rate)")
# for PRECISION output from precision.simulate.flex()
plot.precision(data = precision.results.flex)

Differential expression analysis

We provide limma.pbset to perform differential expression analysis. For instance, when reducing a biological signal, we first identified differentially expressed markers and then shrunk the sample group differences of each of the markers.

uhdata.psl.nc <- uhdata.psl[!rownames(uhdata.psl) %in% ctrl.genes, ] # nc for non-control probes

group.id <- substr(colnames(uhdata.psl.nc), 7, 7)
group.id.level <- levels(as.factor(group.id))
limma.fit.uhdata<- limma.pbset(data = uhdata.psl.nc, 
                               group.id = group.id,
                               group.id.level = group.id.level)
table(limma.fit.uhdata$P.Value < 0.01, dnn = "DE genes")
tab <- data.frame(table(limma.fit.uhdata$P.Value < 0.01))
colnames(tab) <- c("DEA", "Count")
knitr::kable(tab, rownames = NULL)

Biological signal reduction

It may be necessary to manipulate the level of the biological signal (i.e., the mean group difference between sample group) in a molecular classification study. To manipulate the level of the signal, reduce.signal() can be used.

# reduced signal by half
group.id <- substr(colnames(biological.effect.nc), 7, 7)
redhalf.biological.effect.nc <- reduce.signal(biological.effect = biological.effect.nc,
                            group.id = group.id,
                            group.id.level = c("E", "V"),
                            reduce.multiplier = 1/2)


# extract differential expressed genes
biological.effect.nc.psl <- med.sum.pbset(biological.effect.nc)
s.e.limma.fit <- limma.pbset(data = biological.effect.nc.psl,
                         group.id = group.id,
                         group.id.level = c("E", "V"))
de.ind <- s.e.limma.fit$P.Value < 0.01

de.gene.name <- rownames(redhalf.biological.effect.nc)[which(de.ind)][2]
nonde.gene.name <- rownames(redhalf.biological.effect.nc)[-which(de.ind)][2]

biological.effect.nc.de <- biological.effect.nc[de.gene.name, ]
redhalf.biological.effect.nc.de <- redhalf.biological.effect.nc[de.gene.name, ]

biological.effect.nc.nonde <- biological.effect.nc[nonde.gene.name, ]
redhalf.biological.effect.nc.nonde <- redhalf.biological.effect.nc[nonde.gene.name, ]
group.id <- substr(colnames(biological.effect.nc), 7, 7)

# plot
par(mfrow = c(2, 2), mar = c(2, 2, 2, 1))
# for DE genes, the difference has been shrunken in half...
boxplot(biological.effect.nc.de ~ group.id == "E", main = "DE")
boxplot(redhalf.biological.effect.nc.de ~ group.id == "E", main = "DE")
# for non-DE genes, the difference has not been changed
boxplot(biological.effect.nc.nonde ~ group.id == "E", main = "Non-DE")
boxplot(redhalf.biological.effect.nc.nonde ~ group.id == "E", main = "Non-DE")

Handling effect amplification

Again, it may be necessary to manipulate the confounding level of handling effects in the simulation study in order to evaluate normalization or classification methods in the presence of handling effect. We offer three amplification methods in amplify.handling.effect():

  1. location shift - per selected array, shifting all probe expressions up or down by a constant.

  2. scale change 1 - per selected array, expanding the variance of probes towards the array's inter-quartiles; probes that are outside of inter-quartiles remain unchanged.

  3. scale change 2 - scaling all probe expressions in arrays by the power of constants that are different for each batch. In our training data, there are four batches.

handling.effect.nc.tr <- handling.effect.nc[, c(1:64, 129:192)]

# location shift
handling.effect.nc.tr.shift <- amplify.handling.effect(handling.effect = handling.effect.nc.tr,
       amplify.array.id = colnames(handling.effect.nc.tr)[1:64],
       amplify.level = 2, type = "shift")

# scale change 1
handling.effect.nc.tr.scale1 <- amplify.handling.effect(handling.effect = handling.effect.nc.tr,
       amplify.array.id = colnames(handling.effect.nc.tr)[1:64],
       amplify.level = 2, type = "scale1")

# scale change 2
amplify.array.id <- list(1:40, 41:64, (129:160) - 64, (161:192) - 64)
for(i in 1:length(amplify.array.id)) 
  amplify.array.id[[i]] <- colnames(handling.effect.nc.tr)[amplify.array.id[[i]]]
amplify.level <- c(1.2, 1.3, 1/3, 2/3)

handling.effect.nc.tr.scale2 <- amplify.handling.effect(handling.effect = handling.effect.nc.tr,
       amplify.array.id = amplify.array.id,
       amplify.level = amplify.level,
       type = "scale2")

par(mfrow = c(2, 2), mar = c(2, 2, 2, 1))
rng <- range(handling.effect.nc.tr, handling.effect.nc.tr.shift, 
             handling.effect.nc.tr.scale1, handling.effect.nc.tr.scale2)
boxplot(handling.effect.nc.tr, main = "Original",
        ylim = rng, pch = 20, cex = 0.2, xaxt = "n")
boxplot(handling.effect.nc.tr.shift, main = "Shift",
        ylim = rng, pch = 20, cex = 0.2, xaxt = "n")
boxplot(handling.effect.nc.tr.scale1, main = "Scaling 1",
        ylim = rng, pch = 20, cex = 0.2, xaxt = "n")
boxplot(handling.effect.nc.tr.scale2, main = "Scaling 2",
        ylim = rng, pch = 20, cex = 0.2, xaxt = "n")


LXQin/precision documentation built on May 11, 2019, 6:24 p.m.