Description Details Author(s) References Examples
High-throughput experimental data are accumulating exponentially in public databases. However, mining valid scientific discoveries from these abundant resources is hampered by technical artifacts and inherent biological heterogeneity. The former are usually termed "batch effects," and the latter is often modelled by "subtypes." The R package BUScorrect fits a Bayesian hierarchical model, the Batch-effects-correction-with-Unknown-Subtypes model (BUS), to correct batch effects in the presence of unknown subtypes. BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, and (d) enjoying a linear-order computation complexity.
BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, (d) allowing the number of subtypes to vary from batch to batch, (e) integrating batches from different platforms, and (f) enjoying a linear-order computation complexity.
Xiangyu Luo <xyluo1991@gmail.com>, Yingying Wei <yweicuhk@gmail.com>
Maintainer: Xiangyu Luo <xyluo1991@gmail.com>
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | ###############################################################################
#Generate Simulation Data
###############################################################################
rm(list = ls(all = TRUE))
set.seed(123)
B <- 3
#total number of batches
K <- 3
#total number of subtypes
G <- 3000
#total number of genes
pi <- matrix(NA, B, K)
# pi[b,k] stands for the proportion of the kth subtype in the bth batch
pi[1, ] <- c(0.2, 0.3, 0.5)
pi[2, ] <- c(0.4, 0.2, 0.4)
pi[3, ] <- c(0.3, 0.4, 0.3)
#total number of samples in each bacth.
n_vec <- rep(NA, B)
#n_vec[b] represents the total number of samples in batch b.
n_vec <- c(100, 110, 120)
#Data list
example_Data <- list()
#baseline expression level
alpha <- rep(2, G)
#subtype effect
mu <- matrix(NA, G, K)
#subtype effect, mu[g,k] stands for the kth-subtype effect of gene g
mu[ ,1] <- 0
#the first subtype is taken as the baseline subtype
#the subtype effect of subtype 1 is set to zero
mu[ ,2] <- c(rep(2,G/20), rep(0,G/20),rep(0, G-G/20-G/20))
mu[ ,3] <- c(rep(0,G/20), rep(2,G/20),rep(0, G-G/20-G/20))
#batch effect
gamma <- matrix(NA, B, G)
#'location' batch effect of gene g in batch b
gamma[1, ] <- 0
#the first batch is taken as the reference batch without batch effects
#the batch effect of batch 1 is set to zero
gamma[2, ] <- c(rep(3,G/5),rep(2,G/5),rep(1,G/5),
rep(2,G/5),rep(3,G/5))
gamma[3, ] <- c(rep(1,G/5),rep(2,G/5),rep(3,G/5),
rep(2,G/5),rep(1,G/5))
sigma_square <- matrix(NA, B,G)
#sigma_square[b,g] denotes the error variance of gene g in batch b.
sigma_square[1,] <- rep(0.1, G)
sigma_square[2,] <- rep(0.2, G)
sigma_square[3,] <- rep(0.15, G)
Z <- list()
#subtype indicator. Z[b,j] represents the subtype of sample j in batch b
Z[[1]] <- as.integer(c(rep(1,floor(pi[1,1]*n_vec[1])),rep(2,floor(pi[1,2]*n_vec[1])),
rep(3,floor(pi[1,3]*n_vec[1]))))
Z[[2]] <- as.integer(c(rep(1,floor(pi[2,1]*n_vec[2])),rep(2,floor(pi[2,2]*n_vec[2])),
rep(3,floor(pi[2,3]*n_vec[2]))))
Z[[3]] <- as.integer(c(rep(1,floor(pi[3,1]*n_vec[3])),rep(2,floor(pi[3,2]*n_vec[3])),
rep(3,floor(pi[3,3]*n_vec[3]))))
for(b in 1:B){ #generate data
num <- n_vec[b]
example_Data[[b]] <- sapply(1:num, function(j){
tmp <- alpha + mu[ ,Z[[b]][j]] + gamma[b, ] +
rnorm(G, sd = sqrt(sigma_square[b, ]))
tmp
})
}
###############################################################################
#Apply the BUSgibbs Function
###############################################################################
set.seed(123)
BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE)
summary(BUSfits)
## Not run:
#estimated subtypes
est_subtypes <- Subtypes(BUSfits)
#estimated subtype effects
est_subtype_effects <- subtype_effects(BUSfits)
#estimated location batch effects
est_location_batch_effects <- location_batch_effects(BUSfits)
#BIC
BIC_val <- BIC(BUSfits)
#adjusted expression values
adjusted_data <- adjusted_values(BUSfits, Data)
#estimate intrinsic gene indicators by using the default postprob_DE_threshold = 0.5
est_L <- estimate_IG_indicators(BUSfits, postprob_DE_threshold = 0.5)
#obtain the intrinsic gene indicators
intrinsic_gene_indices <- IG_index(est_L)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.