BUSgibbs: Batch Effects Correction and Subtype Discovery using Gibbs...

Description Usage Arguments Details Value Author(s) References Examples

View source: R/BUS_Rfunctions.R

Description

The function "BUSgibbs" stands for fitting the Batch effects correction with Unknown Subtypes model (BUS) with the Gibbs Sampler. BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, and (d) enjoying a linear-order computation complexity. After correcting the batch effects with BUS, the corrected value can be used for other analysis as if all samples are measured in a single batch. We adopt the Bayesian framework and use the Gibbs sampler to conduct posterior inference for the BUS model.

Usage

1
2
3
BUSgibbs(Data, n.subtypes, n.iterations = 500, n.records = floor(n.iterations/2), 
		 hyperparameters = c(1, sqrt(5), sqrt(5), 2, 2, 1, 2, 0.005, 1, 3, 10), 
		 showIteration = TRUE)

Arguments

Data

Data is either an R list or a SummarizedExperiment object. If Data is an R list, it has the length equal to the batch number. The bth element of Data is the gene expression matrix in batch b, where the rows correspond to genes and the columns represent samples. If Data is a SummarizedExperiment object, assays(Data) must contain a gene expression matrix named "GE_matr", where one row represents a gene and one column corresponds to a sample. colData(Data) must include a vector named "Batch", which indicates the batch information for each sample.

n.subtypes

n.subtypes is the subtype number, which needs to be specified by the user.

n.iterations

n.iterations is the iteration number used in the Gibbs sampler. The default is 500.

n.records

The posterior samples in the last n.records iterations are used to conduct posterior inference. The default is one half of n.iterations.

hyperparameters

hyperparameters is a hyper-parameter vector with 11 elements used in the Gibbs sampler. The first element to the last element of hyperparameters are as follows. eta_alpha: the mean of the normal prior for alpha_g; tau_alpha: the standard deviation of the normal prior for alpha_g; tau_gamma: the standard deviation of the normal prior for gamma; alpha_par: the parameter of the Dirichlet prior for subtype proportions; a_inv_gamma: the shape of the gamma prior for 1/sigma^2_bg; b_inv_gamma: the rate of the gamma prior for 1/sigma^2_bg; a_tau0: the shape of the gamma prior for 1/tau^2_mu 0; b_tau0: the rate of the gamma prior for 1/tau^2_mu 0; (a_p, b_p): parameters in the beta prior for p; tau_{mu1}: the standard deviation of the normal prior of mu_gk (k >= 2) when the gene expression level in subtype k is different from that in subtype one.

showIteration

If TRUE, the iteration number will be displayed when conducting Gibbs sampler. The default is TRUE.

Details

Notice that Data, the input original gene expression values, are organized in the format of an R list with length equal to the batch number. Its bth element Data[[b]] is a G by n_b matrix, where G is the gene number and n_b is the sampler size of batch b.

Value

L_PosterSamp

The posterior samples of the intrinsic gene indicators. The return is a G by K-1 by n.records array, where G is the gene number, K is the subtype number, and n.records is the number for recorded iterations.

Subtypes

The estimated subtypes, an R list with length equal to the batch number B, in which Subtypes[[b]] is an integer vector showing the subtype indicators of samples in batch b.

tau_mu_zero

The estimated tau_mu 0, which is the prior normal distribution's standard deviation of the subtype effects when there is no differential expression.

p

The estimated proportion of intrinsic genes.

pi

The estimated subtype proportions across batches, a B by K matrix, whose [b,k] element is the estimated proportion of subtype k in the batch b.

alpha

The estimated baseline expression levels, a G-dimension vector, whose gth element is the estimated mean gene expression level of gene g in subtype one.

gamma_PosterSamp

The posterior samples of location batch effects, a G by B by n.records array.

gamma

The estimated location batch effects, a G by B matrix, where gamma_gb is the “location” batch effect on gene g in the batch b. Note that the first column is zero as the first batch is taken as the reference batch without batch effects.

sigma_sq_PosterSamp

The posterior samples of variances, a G by B by n.records array.

sigma_sq

The estimated variance, a G by B matrix, whose [g,b] element is the variance of gene g's expression in the batch b.

mu_PosterSamp

The posterior samples of subtype effects, a G by K by n.records array.

mu

The estimated subtype effects, a G by K matrix, whose [g,k] element is the subtype k effect on gene g. Note that the first column is zero as the fist subtype is taken as the baseline subtype.

BIC

the BIC value when K = n.subtypes, which is used to determine the subtype number by varying the value of K.

Author(s)

Xiangyu Luo

References

Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
###############################################################################
#Generate Simulation Data
###############################################################################
rm(list = ls(all = TRUE))  
set.seed(123)

B <- 3					
#total number of batches

K <- 3					
#total number of subtypes

G <- 3000					
#total number of genes

pi <- matrix(NA, B, K)			
# pi[b,k] stands for the proportion of kth subtype in bth batch

pi[1, ] <- c(0.2, 0.3, 0.5)
pi[2, ] <- c(0.4, 0.2, 0.4)
pi[3, ] <- c(0.3, 0.4, 0.3)

	
#total number of samples in each bacth.
n_vec <- rep(NA, B) 		
	
#n_vec[b] represents the total number of samples in batch b.
n_vec <- c(100, 110, 120)

#Data list
example_Data <- list()

#baseline expression level
alpha <- rep(2, G)



#subtype effect
mu <- matrix(NA, G, K)			
#subtype effect, mu[g,k] stands for the kth-subtype effect of gene g 

mu[ ,1] <- 0				
#the first subtype is taken as the baseline subtype			
#the subtype effect of subtype 1 is set to zero

mu[ ,2] <- c(rep(2,G/20), rep(0,G/20),rep(0, G-G/20-G/20))
mu[ ,3] <- c(rep(0,G/20), rep(2,G/20),rep(0, G-G/20-G/20)) 

#batch effect
gamma <- matrix(NA, B, G)		
#'location' batch effect of gene g in batch b 
gamma[1, ] <- 0				
#the first batch is taken as the reference batch without batch effects	
#the batch effect of batch 1 is set to zero
gamma[2, ] <- c(rep(3,G/5),rep(2,G/5),rep(1,G/5),
										rep(2,G/5),rep(3,G/5))
gamma[3, ] <- c(rep(1,G/5),rep(2,G/5),rep(3,G/5),
										rep(2,G/5),rep(1,G/5))

sigma_square <- matrix(NA, B,G)	
#sigma_square[b,g] denotes the error variance of gene g in batch b.

sigma_square[1,] <- rep(0.1, G)
sigma_square[2,] <- rep(0.2, G)
sigma_square[3,] <- rep(0.15, G)


Z <- list()			
#subtype indicator. Z[b,j] represents the subtype of sample j in batch b

Z[[1]] <- as.integer(c(rep(1,floor(pi[1,1]*n_vec[1])),rep(2,floor(pi[1,2]*n_vec[1])),
 rep(3,floor(pi[1,3]*n_vec[1]))))	
 
Z[[2]] <- as.integer(c(rep(1,floor(pi[2,1]*n_vec[2])),rep(2,floor(pi[2,2]*n_vec[2])), 
rep(3,floor(pi[2,3]*n_vec[2]))))	

Z[[3]] <- as.integer(c(rep(1,floor(pi[3,1]*n_vec[3])),rep(2,floor(pi[3,2]*n_vec[3])), 
rep(3,floor(pi[3,3]*n_vec[3]))))



for(b in 1:B){				#generate data 
	num <- n_vec[b]
	example_Data[[b]] <-  sapply(1:num, function(j){
						tmp <- alpha + mu[ ,Z[[b]][j]] + gamma[b, ] + 
							rnorm(G, sd = sqrt(sigma_square[b, ])) 
									
						tmp
					}) 

}

###############################################################################
#Apply the BUSgibbs Function
###############################################################################
set.seed(123)
BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE)

BUScorrect documentation built on Nov. 8, 2020, 8:06 p.m.