# Data generation

This section outlines how the simulated data is generated. First, a 'base' expression matrix is generated, with 1,000 variables (rows) and the number of samples equal to the sum of the four group sizes indicated by the user (by default, 5 per group/batch combination, giving a total sample size of 20). The base expression values are sampled independently from a normal distribution with mean 10 and standard deviation 2. The random seed used for the simulation can be changed by the user to generate a different matrix.

Next, the variables affected by the condition, batch or 'unknown' effects (denoted here by $C$, $B$ and %$U$, respectively) are sampled randomly among the full set of variables. The number of variables affected by each type of effect is determined by the user.

For each variable in $C$, the base expression for the samples in condition c2 is either increased or decreased by the indicated condition effect size, determined by the user. The sign of the effect is randomly determined. Similarly, for each variable in $B$, the base expression for the samples in batch b2 is either increased or decreased by the indicated batch effect size. For each variable in $U$, if the unknown effect is designed to be categorical, samples are randomly split into two groups and the base expression for the samples in the second group is either increased or decreased by the indicated unknown effect size. If the unknown effect is designed to be continuous, each sample is randomly assigned a covariate value between 0 and 1, and the base expression is either increased or decreased by this covariate value times the indicated unknown effect size.

More formally (here excluding the unknown effect for simplicity), let $y_{ij}^0\in N(10,2)$ denote the base expression level of variable $i$ in sample $j$. Furthermore, let $S_{b2}$ denote the set of samples from batch $b2$, and similarly let $S_{c2}$ denote the set of samples from condition $c2$. Finally, let $C$ denote the set of variables affected by the condition, and $B$ the set of variables affected by the batch effect. Then, the simulated expression value for variable $i$ in sample $j$ is obtained by $$y_{ij}=y_{ij}^0 + (-1)^{\beta_C}\cdot\ell_C\cdot I(j\in S_{c2})\cdot I(i\in C) + (-1)^{\beta_B}\cdot\ell_B\cdot I(j\in S_{b2})\cdot I(i\in B).$$ Here, $\ell_C$ and $\ell_B$ denote the condition and batch effect sizes, respectively, $I$ denotes the indicator function, and $\beta_C$ and $\beta_B$ are random variables drawn from a Bernoulli(0.5) distribution.

# Statistical testing

The data matrix generated as described above is next subjected to row-wise (independent) statistical tests. Depending on the analysis approach chosen by the user, a different model is fit. The returned p-values, however, always test the null hypothesis that there is no difference in the mean expression value between the two conditions.

If the analysis approach is Don't account for batch effect, a linear model is fit for each row independently, with the condition as the sole predictor. With the Include batch effect in model analysis approach, a linear model is fit with both the batch and condition annotations as predictors.

With the Remove batch effect in advance analysis approach, the batch effect is first removed using the removeBatchEffect function from the limma package. The batch corrected values are then used as the response in a linear model with the condition as the sole predictor. Similarly, if the analysis approach is chosen to be Remove batch effect in advance, accounting for condition, the removeBatchEffect function is applied to remove the batch effect, but retain the condition effect. The batch corrected values are then used as the response in a linear model with the condition as the sole predictor. Note that these approaches (removing the batch effect in advance before fitting the linear model of interest) are not recommended practice, and just included for illustration.

The model fitting returns a p-value for each variable. These are adjusted for multiple testing using the Benjamini-Hochberg method.

csoneson/ConfoundingExplorer documentation built on May 21, 2024, 12:05 p.m.