thin_2group  R Documentation 
Given a matrix of real RNAseq counts, this function will
randomly assign samples to one of two groups, draw
the log2fold change in expression between two groups for each gene,
and add this signal to the RNAseq counts matrix. The user may specify
the proportion of samples in each group, the proportion of null genes
(where the log2fold change is 0),
and the signal function. This is a specific application of the
general binomial thinning approach implemented in thin_diff
.
thin_2group(
mat,
prop_null = 1,
signal_fun = stats::rnorm,
signal_params = list(mean = 0, sd = 1),
group_prop = 0.5,
corvec = NULL,
alpha = 0,
permute_method = c("hungarian", "marriage"),
type = c("thin", "mult")
)
mat 
A numeric matrix of RNAseq counts. The rows index the genes and the columns index the samples. 
prop_null 
The proportion of genes that are null. 
signal_fun 
A function that returns the signal. This should take as
input 
signal_params 
A list of additional arguments to pass to

group_prop 
The proportion of individuals that are in group 1. 
corvec 
A vector of target correlations. 
alpha 
The scaling factor for the signal distribution from
Stephens (2016). If 
permute_method 
Should we use the GaleShapley algorithm
for stable marriages ( 
type 
Should we apply binomial thinning ( 
The specific application of binomial thinning to the twogroup model was used in Gerard and Stephens (2018) and Gerard and Stephens (2021). This is a specific case of the general method described in Gerard (2020).
A listlike S3 object of class ThinData
.
Components include some or all of the following:
mat
The modified matrix of counts.
designmat
The design matrix of variables used to simulate
signal. This is made by columnbinding design_fixed
and the
permuted version of design_perm
.
coefmat
A matrix of coefficients corresponding to
designmat
.
design_obs
Additional variables that should be included in
your design matrix in downstream fittings. This is made by
columnbinding the vector of 1's with design_obs
.
sv
A matrix of estimated surrogate variables. In simulation studies you would probably leave this out and estimate your own surrogate variables.
cormat
A matrix of target correlations between the
surrogate variables and the permuted variables in the design matrix.
This might be different from the target_cor
you input because
we pass it through fix_cor
to ensure
positive semidefiniteness of the resulting covariance matrix.
matching_var
A matrix of simulated variables used to
permute design_perm
if the target_cor
is not
NULL
.
David Gerard
Gale, David, and Lloyd S. Shapley. "College admissions and the stability of marriage." The American Mathematical Monthly 69, no. 1 (1962): 915. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/00029890.1962.11989827")}.
Gerard, D., and Stephens, M. (2021). "Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls." Statistica Sinica, 31(3), 11451166 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.5705/ss.202018.0345")}.
David Gerard and Matthew Stephens (2018). "Empirical Bayes shrinkage and false discovery rate estimation, allowing for unwanted variation." Biostatistics, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1093/biostatistics/kxy029")}.
Gerard, D (2020). "Databased RNAseq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1186/s1285902034509")}.
Hornik K (2005). "A CLUE for CLUster Ensembles." Journal of Statistical Software, 14(12). \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v014.i12")}. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v014.i12")}.
C. Papadimitriou and K. Steiglitz (1982), Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs: Prentice Hall.
Stephens, Matthew. "False discovery rates: a new deal." Biostatistics 18, no. 2 (2016): 275294. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1093/biostatistics/kxw041")}.
Wakefield, Jon. "Bayes factors for genomewide association studies: comparison with Pvalues." Genetic epidemiology 33, no. 1 (2009): 7986. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1002/gepi.20359")}.
select_counts
For subsampling the rows and columns of your real RNAseq count matrix prior to applying binomial thinning.
thin_diff
For the more general thinning approach.
ThinDataToSummarizedExperiment
For converting a ThinData object to a SummarizedExperiment object.
ThinDataToDESeqDataSet
For converting a ThinData object to a DESeqDataSet object.
## Simulate data from given matrix of counts
## In practice, you would obtain Y from a real dataset, not simulate it.
set.seed(1)
nsamp < 10
ngene < 1000
Y < matrix(stats::rpois(nsamp * ngene, lambda = 50), nrow = ngene)
thinout < thin_2group(mat = Y,
prop_null = 0.9,
signal_fun = stats::rexp,
signal_params = list(rate = 0.5))
## 90 percent of genes are null
mean(abs(thinout$coef) < 10^6)
## Check the estimates of the log2fold change
Ynew < log2(t(thinout$mat + 0.5))
X < thinout$designmat
Bhat < coef(lm(Ynew ~ X))["X", ]
plot(thinout$coefmat,
Bhat,
xlab = "log2fold change",
ylab = "Estimated log2fold change")
abline(0, 1, col = 2, lwd = 2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.