uniqueGRfilterByCov: Unique 'GRanges-class' of methylation read counts filtered by...
In genomaths/MethylIT: Methylation Analysis Based on Signal Detection

uniqueGRfilterByCov

R Documentation

Unique `GRanges-class` of methylation read counts filtered by coverage.

Description

Given two GRanges-class objects, samples '1' and '2', carrying the counts of methylated (mC) and unmethylated (uC) cytosines in their metacolumns, this function will filter by coverage each cytosine site from each GRanges-class object.

Usage

uniqueGRfilterByCov(
  x,
  y = NULL,
  min.coverage = 4,
  and.min.cov = TRUE,
  min.meth = 0,
  min.umeth = 0,
  min.sitecov = 4,
  percentile = 0.9999,
  min.percentile = TRUE,
  high.coverage = NULL,
  columns = c(mC = 1, uC = 2),
  ignore.strand = FALSE,
  y.centroid = NULL,
  num.cores = 1L,
  tasks = 0L,
  verbose = TRUE,
  ...
)

Arguments

`x`	An object from the classes 'GRanges', 'InfDiv', or 'pDMP' with methylated and unmethylated counts in its meta-column. If the argument 'y' is not given, then it is assumed that the first four columns of the `GRanges-class` metadata 'x' are counts: methylated and unmethylated counts for samples '1' and '2'. If `x` is a 'InfDiv', or 'pDMP', then `'y'` is not needed, since samples '1' and '2' are the first four columns of these objects. In the context of MethylIT methylation analysis, `x` is considered a control sample or, more appropriated (if available) the centroide of control group.
`y`	A `GRanges-class` or a `GRangesList-class` object with methylated and unmethylated counts in its meta-column. Default is NULL. If 'y' is a `GRangesList-class` object, then the first two metacolumns from 'x' are used in the pair-wise filtering toguether with the first two metacolumns from each `GRanges-class` element from 'y'.
`min.coverage`	An integer or an integer vector of length 2. If 'min.coverage' is an integer vector, then the corresponding min coverage is applied to each sample. Default is 4, i.e., `min.coverage = c(4,4)`.
`and.min.cov`	Logical. Whether to apply the logical AND to select the cytosine sites based on `min.coverage`. If FALSE, then a logical OR is applied, and cytosine sites where at least one sample hold the `min.coverage` are preserved. Default is TRUE.
`min.meth`	An integer or an integer vector of length 2. Cytosine sites where the numbers of read counts of methylated cytosine in both samples, '1' and '2', are less than 'min.meth' are discarded. If 'min.meth' is an integer vector, then the corresponding min number of reads is applied to each sample. That is, `min.meth` implement a logical OR.
`min.umeth`	An integer or an integer vector of length 2. Minimum number of reads to consider cytosine position. Specifically cytosine positions where `uC \leq min.umeth` are discarded. Default is `min.umeth = 0`.
`min.sitecov`	An integer. The minimum total coverage. Only sites where the total coverage `cov1 + cov2` is greater than 'min.sitecov' are considered for downstream analysis, where cov1 and cov2 are the coverages for samples 1 and 2, respectively.
`percentile`	Threshold to remove the outliers (PCR bias) from each file and all files stacked. If 'high.coverage = NULL', then the threshold `q` will be computed as: `q1 = quantile(cov1, probs=percentile)` `q2 = quantile(cov2, probs=percentile)` . where `cov1` and `cov2` are the coverage vectors from samples 1 and 2, respectively.
`min.percentile`	Logical. Each sample yield a percentile value. The user must decide whether to use the minimum or the maximum value from these percentile values. Default is TRUE. Hence, `min.percentile = FALSE` implies to use the maximum value.
`high.coverage`	An integer for read counts. Cytosine sites having higher coverage than this are discarded. Default is NULL.
`columns`	Vector of integer numbers of the columns (from each GRanges meta-column) where the methylated and unmethylated counts are provided. If not provided, then the methylated and unmethylated counts are assumed to be at columns 1 and 2, respectively.
`ignore.strand`	When set to TRUE, the strand information is ignored in the overlap of `GRanges-class` objects. This is a parameter passed to `uniqueGRanges` function. Default value: FALSE.
`y.centroid`	Optional. A `GRanges-class` object corresponding to the treatment/individual centroide. This information is applied to reduce the bias originated by missing cytosine sites. The centroide can be compute applying function `poolFromGRlist`.
`num.cores, tasks`	Parameters for parallel computation using package `BiocParallel-package`: the number of cores to use, i.e. at most how many child processes will be run simultaneously (see `bplapply` and the number of tasks per job (only for Linux OS).
`verbose`	if TRUE, prints the function log to stdout
`...`	Additional parameters for `uniqueGRanges` function.

Details

Cytosine sites with 'coverage' > 'min.coverage' in at least one of the samples are preserved. Positions with 'coverage' < 'min.coverage' in both samples, 'x' and 'y', are removed. Positions with 'coverage' > 'percentile' (e.g., 99.9 percentile) are removed as well. It is expected that the columns of methylated and unmethylated counts are given.

This function is addressed to create pair-wise GRanges-class object with four metacolumns of count: samples '1' and '2', carrying the counts of methylated (mC) and unmethylated (uC) cytosines in their metacolumns, respectively. Counts from sample 1 are typically used as reference counts in computing information divergences in the downtstream analysis.

The cut-off value to remove PCR bias is computed as:

If high.coverage is NULL, then the cut-off point is: q = min(q1, q2) (if min.percentile = TRUE) or q = max(q1, q2). If high.coverage is not NULL, then

q = max(q, high.coverage)

Another source of bias is originated by missing cytosine sites. Missing data are frequently found in experimental data sets and, in particular, in bisulfite genomic sequencing data. Typically, in statistical analyses, the bias originated by missing data (for given variable) is mitigated by using the mean of the known values for the corresponding variable. That is, in present case, if the reads for some cytosine site are missed in a sample from a set of, e.g., three individuals, then the means of reads (methylated and unmethylated) for such site are applied as an estimation of the best expected ("guessed") value of missed reads. Obviously, if the reads are missed in all the samples, then the site is discarded (see examples).

The treatment centroide can be compute applying function poolFromGRlist. Also notice that, since the centroide correction is only available for the treat group, it is assumed that sample x carries reads for each (or almost all) cytosine sites are provide.

Value

if 'x' and 'y' are GRanges-class object, then a GRanges-class with the columns of methylated and unmethylated counts filtered for each cytosine position.

A GRangesList-class object will returned, if 'y' is a GRangesList-class object of same length as length(y) and named as names(y).

Examples

## Create new data. It is assumed that sample 'x' carries reads
## for each cytosine sites are provide.
strands <- c("+","-","+","-", "+","-","+","+","+","+","+")
pos <- c(10,11,11,12,13,13,14,15,16,17,18)

x <- data.frame(chr = 'chr1', start = pos, end = pos,
                mC = c(2,3,2,5,10,7,9,11,4,10,7),
                uC = c(2,30,20,4,8,0,10,3,0,8,1),
                strand = strands)

x <- makeGRangesFromDataFrame(x, keep.extra.columns = TRUE)
x

## sample y
y <- data.frame(chr = 'chr1', start = 11:18, end = 11:18,
                mC2 = c(4,1,2,1,4,5:7), uC2 = c(0,0,2:7),
                strand = c("+","-","-","+","+","+","+","+"))

y <- makeGRangesFromDataFrame(y, keep.extra.columns = TRUE)
y

## The default settings. Sites where one of the samples has zero methylation
## calling or min.coverage is lesser than 4 in at least one sample are
## discarded. This setting implies a drastic decision and the amount of
## cytosine removed can be lead to strong biased conclusions in the
## downstream analysis.
uniqueGRfilterByCov(x = x, y = y,
                    percentile = 1,
                    ignore.strand = FALSE)

## Setting 'and.min.cov = FALSE' undesired cytosine sites are preseved. For
## example, meaningless situations with  methylation levels
## p = 1/(1 + 0) = 1.
uniqueGRfilterByCov(x = x, y = y,
                    and.min.cov = FALSE,
                    ignore.strand = FALSE,
                    percentile = 1,
                    verbose = FALSE)

## Setting 'min.coverage = 8' does not solves the previous issue, but still
## it preserves cytosine sites with one of the samples with zero or small
## coverage:
uniqueGRfilterByCov(x = x, y = y,
                    and.min.cov = FALSE,
                    min.coverage = 8,
                    percentile = 1,
                    ignore.strand = FALSE,
                    verbose = FALSE)

## A centroid, a vector of means of methylation read for each cytosine site
## from the treatment group can be used as the best estimation to replace
## missing data: 'mC=0' and 'uC=0' or low coverage sites 'mC=1' and 'uC=0'.
y_centroid <- data.frame(chr = 'chr1',
                start = pos, end = pos,
                mC2 = c(8,7,6,7,5,8,1:5), uC2 = 0:10,
                strand = c("+","-","+","-", "+","-","+","+","+","+","+"))
y_centroid <- makeGRangesFromDataFrame(y_centroid,
                                      keep.extra.columns = TRUE)
y_centroid

## The cytosine sites with missing data or low coverage will be still,
## included, using the centroid of the centroinde of th sample group to
## which 'y' belong to.
uniqueGRfilterByCov(x = x, y = y,
                    and.min.cov = FALSE,
                    min.coverage = c(1, 8),
                    ignore.strand = FALSE,
                    y.centroid = y_centroid,
                    min.percentile = FALSE,
                    percentile = 1,
                    verbose = FALSE)

genomaths/MethylIT documentation built on Feb. 3, 2024, 1:24 a.m.