normalize: Normalization of ChIP-seq and other count data

Description Usage Arguments Details Value Author(s) References Examples

Description

This function implements some methods for between-sample normalization of count data. Although these methods were developed for RNA-seq data, they are also useful for ChIP-seq data normalization after reads were counted within regions or bins. Some methods may also be applied to count data after within-sample normalization (e.g. TPM or RPKM values).

Usage

1
2
3
4
## S4 method for signature 'ChIPseqSet'
normalize(object, method, isLogScale = FALSE, trim = 0.3, totalCounts)
## S4 method for signature 'ExpressionSet'
normalize(object, method, isLogScale = FALSE, trim = 0.3, totalCounts)

Arguments

object

An object of class ChIPseqSet or ExpressionSet that contains the raw data.

method

Normalization method, either "scale", "scaleMedianRegion", "quantile" or "tmm".

isLogScale

Indicates whether the raw data in object is already logarithmized. Default value is FALSE. Logarithmized data will be returned on the log scale, non logarithmized data will remain on its original scale.

trim

Only used if method is "tmm". Indicates the fraction of data points that should be trimmed before calculating the mean. Default value is 0.3.

totalCounts

Only used if method is "scale". A vector giving the total number of reads for each sample. The Vector's length must equal the number of samples in object. Default values are the sums over all features for each sample (i.e. colsums of object).

Details

The following normalization methods are implemented:

  1. scaleSamples are scaled by a factor such that all samples have the same number N of reads after normalization, where N is the median number of reads observed accross all samples. If the argument totalCounts is missing, the total numbers of reads are calculated from the given data. Otherwise, the values in totalCounts are used.

  2. scaleMedianRegionThe scaling factor s_j for the j-th sample is defined as

    s_j = median_i \frac{k_{ij}}{∏_{v=1}^m k_{iv}}.

    k_{ij} is the value of region i in sample j. See Anders and Huber (2010) for details.

  3. quantileQuantile normalization is applied to the ChIP-seq values such that each sample has the same cdf after normalization.

  4. tmmThe trimmed mean M-value (tmm) normalization was proposed by Robinson and Oshlack (2010). Here, the logarithm of the scaling factor for sample i is calculated as the trimmed mean of

    \log(k_{i,j}/m_{j}).

    Variable m_{j} denotes the geometric mean of region j. Argument trim is set to 0.3 as default value, so that the smallest 15% and the largest 15% of the log ratios are trimmed before calculating the mean.

Value

An object of the same class as the input object with the normalized data.

Author(s)

Hans-Ulrich Klein (hklein@broadinstitute.org)

References

Anders and Huber. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106.\ Robinson and Oshlack. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
  set.seed(1234)
  chip <- matrix(c(rpois(20, lambda=10), rpois(20, lambda=20)), nrow=20,
                 dimnames=list(paste("feature", 1:20, sep=""), c("sample1", "sample2")))
  rowRanges <- GRanges(IRanges(start=1:20, end=1:20),
                     seqnames=c(rep("1", 20)))
  names(rowRanges) = rownames(chip)
  cSet <- ChIPseqSet(chipVals=chip, rowRanges=rowRanges)

  tmmSet <- normalize(cSet, method="tmm", trim=0.3)
  mean(log(chipVals(tmmSet))[, 1], trim=0.3) -
      mean(log(chipVals(tmmSet))[, 2], trim=0.3) < 0.01

  quantSet <- normalize(cSet, method="quantile")
  all(quantile(chipVals(quantSet)[, 1]) == quantile(chipVals(quantSet)[, 2]))

epigenomix documentation built on Nov. 8, 2020, 5:24 p.m.