View source: R/normalizeBatch.R
normalizeBatch | R Documentation |
Perform normalization to correct intensities across batches with at least one common level.
normalizeBatch(batch.x, batch.comp, mode="range", p=0.01,
fix.zero=FALSE, target=NULL, markers=NULL, ...)
batch.x |
A list of length equal to the number of batches.
Each element of the list should be of the same type as |
batch.comp |
A list of length equal to the number of batches.
Each element should be a factor (or coercible to a factor) specifying the composition of each batch,
i.e., which samples belong to which groups.
Also can be |
mode |
A string or character vector of length equal to the number of markers, specifying whether range-based or warping normalization should be performed for each marker.
This can take values of |
p |
A numeric scalar between 0 and 0.5, specifying the percentile used to define the range of the distribution for range-based normalization. |
fix.zero |
A logical scalar indicating whether zero intensities should remain at zero when |
target |
An integer scalar indicating the reference batch. |
markers |
A character vector specifying the markers to be normalized and returned. |
... |
Additional arguments to be passed to |
Consider an experiment containing several batches of barcoded samples, in which the barcoding was performed within but not between batches. This function normalizes the intensities for each marker such that they are comparable between samples in different batches. The process for each marker is as follows:
Weight each batch by composition and number of cells.
Compute a transformation function for each batch to a reference intensity distribution.
Apply the batch-specific functions to that batch's samples to obtain normalized intensities.
Each element of batch.x
should contain data from all samples of a single batch.
This can be a ncdfFlowSet constructed from all samples in that batch,
or a list of intensity matrices from all samples.
In short, each element should be equivalent to the x
argument in ?prepareCellData
.
Each element of batch.comp
should correspond to the same batch as the entry at the same position of batch.x
.
Each element should be a factor of length equal to the number of samples in the associated batch.
Each value of the factor specifies the group identity of a sample in the batch,
and should follow the ordering of samples within the corresponding element of batch.x
.
All markers are used by default when markers=NULL
.
If markers
is specified, only the specified markers will be normalized and returned in the output expression matrices.
This is usually more convenient than subsetting the inputs or outputs manually.
To convert the output into a format appropriate for prepareCellData
, apply unlist
with recursive=FALSE
.
This will generate a list of intensity matrices for all samples in all batches, rather than a list of list of matrices.
Note that a batch effect should still be included in the design matrix when modelling abundances, as only the intensities are corrected here.
A list of lists, where each internal list corresponds to a batch and contains intensity matrices corresponding to all samples in that batch.
This matches the format of batch.x
.
Weighting is performed to downweight the contribution of larger samples within each batch, as well as to match the composition of samples across different batches.
The composition of each batch can be specified by batch.comp
, see below for more details.
The weighted intensities for each batch represents the pooled distribution of intensities from all samples in that batch.
Groupings can be specified as batch-specific factors in batch.comp
, with at least one common group required across all batches.
This composition is used to weight the contribution of each sample to the reference distribution.
For example, a batch with more samples in group A and fewer samples in group B would get lower weights assigned to the former and larger weights to the latter.
Ideally, all batches would contain samples from all groups, with similar total numbers of cells across batches for each group. This maximizes the number of samples that are used to construct the reference distribution (and thus the stability of the reference). Samples in groups that are not present in all batches will be ignored as no weighting can be applied to an absent group.
If the composition of each batch is the same, batch.comp
can be set to NULL
rather than being manually specified.
If mode="range"
, a quantile function is constructed for the pooled distribution of each batch.
These batch-specific functions are used to contruct a reference quantile function, representing a reference distribution.
A batch-specific scaling function is defined to equalize the range of the weighted distribution of intensities from each batch to the range of this reference distribution.
The range of each distribution is computed at percentiles p
and 1-p
to avoid distortions due to outliers.
If fix.zero=TRUE
, the lower bound of the range is always set to zero to avoid turning a zero intensity into a non-zero value after normalization.
If mode="quantile"
, a quantile function is constructed as described above, along with a reference quantile function.
A batch-specific transformation function is defined to convert the quantiles for each batch to the corresponding quantiles of the reference.
This is equivalent to coercing all batch-specific intensity distributions to the reference distribution.
If mode="warp"
, a mock set of intensities is generated is generated that accounts for the differential weighting of events in each batch.
This is used to construct a flowSet for use in warping normalization - see ?normalization
and ?warpSet
for details.
A warping function is computed for each batch that equalizes the locations of landmarks (i.e., peaks) in both the batch-specific and reference intensity distributions.
If target=NULL
, the reference distribution for each marker is defined as an average of the relevant statistic across batches.
For mode="range"
or mode="quantile"
, this means that the reference quantile function is defined as the average of the functions across all batches.
For mode="warp"
, the average location of each landmark across all batches is used.
If target
is not NULL
, the specified batch will be used as the reference distribution.
For mode="range"
or mode="quantile"
, this means that the reference quantile function will be defined as the quantile function of the chosen batch.
Similarly, if mode="warp"
, warpSet
will align all other batches to the locations of the peaks in target
.
The transformation function is applied to the intensities of all samples in that batch, yielding corrected intensities for direct comparisons between samples. This is possible provided that there is at least one group that is present across all batches, in order to construct a common reference distribution. The assumption is that the batch effect in those shared groups is the same as that in the batch-specific groups. However, note that the adjustment may not be accurate if the to-be-corrected intensities lie outside the range of values used to construct the function.
Warping normalization can be more powerful than range-based normalization, as the former can eliminate non-linear changes to the intensities whereas the latter cannot.
However, it requires that landmarks in the intensity distribution (i.e., peaks) be easily identifiable and consistent across batches.
Large differences (e.g., a peak present in one batch and absent in another) may lead to incorrect adjustments.
Such differences may be present when batches are confounded with uninteresting biological factors (e.g., individual, mouse of origin) that affect cell abundance.
In such cases, range-based normalization with mode="range"
is recommended as it is more constrained in how the intensities are adjusted.
This reduces the risk of distorting the intensities, albeit at the cost of “under-normalizing” the data.
Quantile normalization is the most powerful of these methods but also requires the strongest assumptions.
In particular, it requires that the composition of samples in shared groups is exactly the same across all batches.
This is only practically applicable to experimental designs where the same control sample has been used in multiple batches.
In such cases, users should set all control samples to the same “group” in batch.comp
,
while all other samples should be set to batch-specific groups (and are thus ignored during calculation of the transformation functions).
Note that this approach also requires the control samples to cover the range of intensities in the other samples,
otherwise the transformation function will need to extrapolate - often badly.
It is advisable to inspect the intensity distributions before and after normalization, to ensure that the methods have behaved appropriately.
This can be done by constructing histograms for each marker with multiIntHist
.
Aaron Lun
prepareCellData
,
multiIntHist
,
normalization
,
warpSet
### Mocking up some data: ###
nmarkers <- 10
marker.names <- paste0("X", seq_len(nmarkers))
all.x <- list()
for (b in paste0("Batch", 1:3)) { # 3 batches
nsamples <- 10
sample.names <- paste0("Y", seq_len(nsamples))
trans.shift <- runif(nmarkers, 0, 1)
trans.grad <- runif(nmarkers, 1, 2)
x <- list()
for (i in sample.names) {
ex <- matrix(rgamma(nmarkers*1000, 2, 2), nrow=nmarkers)
ex <- t(ex*trans.grad + trans.shift)
colnames(ex) <- marker.names
x[[i]] <- ex
}
all.x[[b]] <- x
}
batch.comp <- list( # Each batch contains different composition/ordering of groups
factor(rep(1:2, c(3,7))),
factor(rep(1:2, c(7,3))),
factor(rep(1:2, 5))
)
### Running the function: ###
corrected <- normalizeBatch(all.x, batch.comp, mode="range")
par(mfrow=c(1,2))
plot(ecdf(all.x[[1]][[3]][,1]), col="blue", main="Before")
plot(ecdf(all.x[[2]][[3]][,1]), add=TRUE, col="red")
plot(ecdf(corrected[[1]][[3]][,1]), col="blue", main="After")
plot(ecdf(corrected[[2]][[3]][,1]), add=TRUE, col="red")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.