samesum_log2_norm | R Documentation |
This is a method of computing log counts while dealing sensibly with zeros and differing library sizes. It should cope well with very sparse data. The input is transformed using log2(x/scale+1)
with a different scale for each sample. The scale for each sample is chosen so all of the samples add up to the same total. Finding these scale values is done efficiently using Newton's method.
samesum_log2_norm(x, scale = 2, sum = NA, tol = 1e-09, verbose = FALSE)
samesum_log2_cpm(x, scale = 2, sum = NA, tol = 1e-09, verbose = FALSE)
x |
A matrix or a sparse matrix. Usually these will be counts, but any matrix of non-negative values could potentially be used. We regard columns as samples. |
scale |
Target scale. Aim to give each sample close to this value of scale. The default value of 2 should be reasonable for count data, but see discussion below. |
sum |
Target sum. Aim to give each sample exactly this sum. |
tol |
Tolerance when optimizing scale values. 1/scale is increased from zero using Newton's method until the sum is at least |
verbose |
If TRUE, output some debugging messages. |
You can either specify the argument scale
, and it will try to give individual samples scale values close to this value, or directly specify the target sum with sum
.
samesum_log2_cpm
adds a constant to the result to produce values that can be treated as log2 Counts Per Million (CPM). The result will have the property that mean(colSums(2^lcpm-2^attr(lcpm,"zero")))
equals one million.
Unlike using library size normalization, the samesum method is robust to variations in high abundance features, because it works on the log transformed values. It is also robust to noise from low abundance features because of the +1
in the transformation.
This method has some similarity to Centered Log Ratio (CLR) transformation. Where the CLR ensures each sample adds to zero, the samesum transformation ensures each sample adds to the same non-zero value. Like the CLR, distances between samesum transformed data are a good way to measure sample similarity. Transformed values could be used with PCA or with non-linear dimension reduction methods.
Samesum transformed data can also be used in differential abundance analysis using the limma-trend or limma-vooma methods. As usual, the assumption here is that there is a background of unchanging features that largely determine the scaling of each sample. This is usually true for RNA-Seq differential expression analysis, but seems to be questioned more often in metagenomics.
The default value of 2 for scale should be reasonable for count data, but is not variance stabilizing. limma-trend or limma-vooma will take this into account. For visualization and exploratory methods you could try a larger scale value, which will damp down variation in low abundance rows and provide better variance stabilization. The scale parameter acts similarly to a pseudocount in log transformation methods with a pseudocount parameter (such as edgeR's cpm
function).
When different samples have different library sizes, some of them may be given a scale much smaller than the target scale. When this happens, these samples may appear as outliers in data visualization, purely as an artifact of the transformation. A warning is produced if any samples are given a scale less than 0.5. Consider excluding these samples from your analysis. If all the samples need to be used, you can increase the target scale to avoid this problem. This will allow you to compare all samples, but at the cost of losing some ability to resolve fine differences between samples.
A matrix or a sparse matrix, depending on the input.
The return value will have some attributes accessible using attr()
.
"samples" is a data frame with per-sample information. The most important value for the transformation is the "scale" column. There is also "true_library_size", which is the sum of the input column, and "adjusted_library_size" if you want to use this method just as a library size adjustment method.
"zero" is the value that zero is transformed to. This will be zero for samesum_log2_norm
.
"sum" is the value each column of the output sums to (but not including the CPM adjustment if using samesum_log2_cpm
).
set.seed(563)
x <- matrix(rpois(15, 10), ncol=3)
samesum_log2_norm(x)
samesum_log2_cpm(x)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.