thin_diff  R Documentation 
Given a matrix of real RNAseq counts, this function will add a known amount of signal to the count matrix. This signal is given in the form of a Poisson / negative binomial / mixture of negative binomials generalized linear model with a log (base 2) link. The user may specify any arbitrary design matrix and coefficient matrix. The user may also control for the amount of correlation between the observed covariates and any unobserved surrogate variables. The method is described in detail in Gerard (2020).
thin_diff( mat, design_fixed = NULL, coef_fixed = NULL, design_perm = NULL, coef_perm = NULL, target_cor = NULL, use_sva = FALSE, design_obs = NULL, relative = TRUE, change_colnames = TRUE, permute_method = c("hungarian", "marriage"), type = c("thin", "mult") )
mat 
A numeric matrix of RNAseq counts. The rows index the genes and the columns index the samples. 
design_fixed 
A numeric design matrix whose rows are fixed and not to be permuted. The rows index the samples and the columns index the variables. The intercept should not be included (though see Section "Unestimable Components"). 
coef_fixed 
A numeric matrix. The coefficients corresponding to

design_perm 
A numeric design matrix whose rows are to be permuted (thus controlling the amount by which they are correlated with the surrogate variables). The rows index the samples and the columns index the variables. The intercept should not be included (though see Section "Unestimable Components"). 
coef_perm 
A numeric matrix. The coefficients corresponding to

target_cor 
A numeric matrix of target correlations between the
variables in 
use_sva 
A logical. Should we use surrogate variable analysis
(Leek and Storey, 2008) using 
design_obs 
A numeric matrix of observed covariates that are NOT to
be a part of the signal generating process. Only used in estimating the
surrogate variables (if 
relative 
A logical. Should we apply relative thinning ( 
change_colnames 
A logical. Should we change the columnnames
of the design matrices ( 
permute_method 
Should we use the GaleShapley algorithm
for stable marriages ( 
type 
Should we apply binomial thinning ( 
A listlike S3 object of class ThinData
.
Components include some or all of the following:
mat
The modified matrix of counts.
designmat
The design matrix of variables used to simulate
signal. This is made by columnbinding design_fixed
and the
permuted version of design_perm
.
coefmat
A matrix of coefficients corresponding to
designmat
.
design_obs
Additional variables that should be included in
your design matrix in downstream fittings. This is made by
columnbinding the vector of 1's with design_obs
.
sv
A matrix of estimated surrogate variables. In simulation studies you would probably leave this out and estimate your own surrogate variables.
cormat
A matrix of target correlations between the
surrogate variables and the permuted variables in the design matrix.
This might be different from the target_cor
you input because
we pass it through fix_cor
to ensure
positive semidefiniteness of the resulting covariance matrix.
matching_var
A matrix of simulated variables used to
permute design_perm
if the target_cor
is not
NULL
.
Let
Be the number of samples.
Be the number of genes.
Be an G by N matrix of real RNAseq counts.
This is mat
.
Be an N by P_1 userprovided design matrix.
This is design_fixed
.
Be an N by P_2 userprovided design matrix.
This is design_perm
.
Be an N by P_3 matrix of known covariates.
This is design_obs
.
Be an N by K matrix of unobserved surrogate
variables. This is estimated when target_cor
is not
NULL
.
Be a G by N of additional (unknown) unwanted variation.
We assume that Y is Poisson distributed given X_3 and Z such that
\log_2(EY) = μ 1_N' + B_3X_3' + AZ' + M.
thin_diff()
will take as input X_1, X_2, B_1,
B_2, and will output a \tilde{Y} and W such that
\tilde{Y} is Poisson distributed given X_1, X_2, X_3,
W, Z, and M such that
\log_2(E\tilde{Y}) \approx \tilde{μ}1_N' + B_1X_1' + B_2X_2'W' + B_3X_3' + AZ' + M,
where W is an N by N permutation matrix. W is randomly drawn so that WX_2 and Z are correlated approximately according to the target correlation matrix.
The Poisson assumption may be generalized to a mixture of negative binomials.
It is possible to include an intercept term or a column from
design_obs
into either design_fixed
or design_perm
.
This will not produce an error and the specified thinning will be applied.
However, If any column of design_fixed
or
design_perm
is a vector of ones or contains a column from
design_obs
, then the corresponding columns in coef_fixed
or coef_perm
cannot be estimated by any method. This is
represented in the output by having duplicate columns in
designmat
and design_obs
.
Including duplicate columns in design_fixed
and design_perm
is also allowed but, again, will produce unestimable coefficients.
Including an intercept term in design_obs
will produce an error if
you are specifying correlated surrogate variables.
David Gerard
Gale, David, and Lloyd S. Shapley. "College admissions and the stability of marriage." The American Mathematical Monthly 69, no. 1 (1962): 915. doi: 10.1080/00029890.1962.11989827.
Gerard, D (2020). "Databased RNAseq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. doi: 10.1186/s1285902034509.
Hornik K (2005). "A CLUE for CLUster Ensembles." Journal of Statistical Software, 14(12). doi: 10.18637/jss.v014.i12.
Leek, Jeffrey T., and John D. Storey. "A general framework for multiple testing dependence." Proceedings of the National Academy of Sciences 105, no. 48 (2008): 1871818723. doi: 10.1073/pnas.0808709105.
C. Papadimitriou and K. Steiglitz (1982), Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs: Prentice Hall.
select_counts
For subsampling the rows and columns of your real RNAseq count matrix prior to applying binomial thinning.
thin_2group
For the specific application of
thin_diff
to the twogroup model.
thin_lib
For the specific application of
thin_diff
to library size thinning.
thin_gene
For the specific application of
thin_diff
to total gene expression thinning.
thin_all
For the specific application of
thin_diff
to thinning all counts uniformly.
thin_base
For the underlying thinning function
used in thin_diff
.
sva
For the implementation of surrogate variable analysis.
ThinDataToSummarizedExperiment
For converting a ThinData object to a SummarizedExperiment object.
ThinDataToDESeqDataSet
For converting a ThinData object to a DESeqDataSet object.
## Generate simulated data with surrogate variables ## In practice, you would obtain mat from a real dataset, not simulate it. set.seed(1) n < 10 p < 1000 Z < matrix(abs(rnorm(n, sd = 4))) alpha < matrix(abs(rnorm(p, sd = 1))) mat < round(2^(alpha %*% t(Z) + abs(matrix(rnorm(n * p, sd = 5), nrow = p, ncol = n)))) ## Choose simulation parameters design_perm < cbind(rep(c(0, 1), length.out = n), runif(n)) coef_perm < matrix(rnorm(p * ncol(design_perm), sd = 6), nrow = p) ## Specify one surrogate variable (number of columns in taget_cor), ## highly correlated with first observed covariate and uncorrelated ## with second observed covariate target_cor < matrix(c(0.9, 0)) ## Thin thout < thin_diff(mat = mat, design_perm = design_perm, coef_perm = coef_perm, target_cor = target_cor) ## target_cor approximates correlation between estimated surrogate variable ## and matching variable. cor(thout$matching_var, thout$sv) ## Estimated surrogate variable is associated with true surrogate variable ## (because the signal is strong in this case) plot(Z, thout$sv, xlab = "True SV", ylab = "Estimated SV") ## So target_cor approximates correlation between surrogate variable and ## matching variables cor(thout$matching_var, Z) ## Correlation between permuted covariates and surrogate variables are less ## close to target_cor cor(thout$designmat, Z) ## Estimated signal is correlated to true single. First variable is slightly ## biased because the surrogate variable is not included. Ynew < log2(t(thout$mat) + 0.5) X < thout$designmat coef_est < t(coef(lm(Ynew ~ X))[2:3, ]) plot(thout$coefmat[, 1], coef_est[, 1], main = "First Variable", xlab = "Coefficient", ylab = "Estimated Coefficient") abline(0, 1, col = 2, lwd = 2) plot(thout$coefmat[, 2], coef_est[, 2], main = "Second Variable", xlab = "Coefficient", ylab = "Estimated Coefficient") abline(0, 1, col = 2, lwd = 2) ## But estimated coefficient of the first variable is slightly closer when ## the surrogate variable is included. Ynew < log2(t(thout$mat) + 0.5) X < cbind(thout$designmat, thout$sv) coef_est < t(coef(lm(Ynew ~ X))[2:3, ]) plot(thout$coefmat[, 1], coef_est[, 1], main = "First Variable", xlab = "Coefficient", ylab = "Estimated Coefficient") abline(0, 1, col = 2, lwd = 2) plot(thout$coefmat[, 2], coef_est[, 2], main = "Second Variable", xlab = "Coefficient", ylab = "Estimated Coefficient") abline(0, 1, col = 2, lwd = 2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.