divTest: Group Comparisons of Information Divergences Based on...

View source: R/divTest.R

divTestR Documentation

Group Comparisons of Information Divergences Based on Generalized Linear Model

Description

Generalized Linear Model for group comparison of information divergence variables yielded by function estimateDivergence from MethylIT R package output. Basically, this a wrapping function to perform the fitting of generalized linear models with glm from 'stats' package to any variable of interest given in GRanges objects of estimateDivergence output.

Usage

divTest(
  GR,
  control.names,
  treatment.names,
  glm.family = Gamma(link = "log"),
  var.weights = FALSE,
  weights = NULL,
  varFilter = 0,
  meanFilter = 0,
  FilterLog2FC = TRUE,
  Minlog2FC = 1,
  divPerBp = 0.001,
  minInd = 2,
  pAdjustMethod = NULL,
  scaling = 1L,
  pvalCutOff = 0.05,
  saveAll = FALSE,
  num.cores = 1,
  tasks = 0L,
  verbose = TRUE,
  ...
)

Arguments

GR

GRanges objects including control and treatment samples containing an information divergence of methylation levels. The names for each column must coincide with the names given for parameters: 'control.names' and 'treatment.names'.

control.names

Names/IDs of the control samples, which must be included in the variable GR in a metacolumn.

treatment.names

Names/IDs of the treatment samples, which must be included in the variable GR in a metacolumn.

glm.family, link

Parameter to be passed to function glm. A description of the error distribution and link function to be used in the model. For glm this can be a character string naming a family function, or the result of a call to a family function. For glm.fit only the third option is supported. (Seefamily function). Default: glm.family=Gamma(link ='log').

var.weights

Logical (default: FALSE). Whether to use group variances as weights.

weights

An optional list of two numeric vectors of ‘prior weights’ to be used in the fitting process. One vector of weights for the control and one for the treatment. Each vector with length equal to length(GR) (default: NULL). Non-NULL weights can be used to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions).

varFilter

Numeric (default: 0). GLM will be performed only for those rows (ranges denoting genomic regions) where the group variance is greater the number specified by varFilter.

meanFilter

Numeric (default: 0). GLM will be performed only for those rows (ranges denoting genomic regions) where the absolute difference of group means is greater the number specified by meanFilter.

FilterLog2FC

if TRUE, the results are filtered using the minimun absolute value of log2FoldChanges observed to accept that a gene in the treatment is differentially expressed in respect to the control.

Minlog2FC

minimum logarithm base 2 of fold changes

divPerBp

At least for one group the mean divergence per bp must be equal to or greater than 'divPerBp' (default divPerBp = 0.001).

minInd

Integer (Default: 2). At least one group must have 'minInd' individuals with a divergence value greater than zero.

pAdjustMethod

Method used to adjust the results; default: 'NULL' (see p.adjust.methods). The p-value adjustment is performed using function p.adjust.

scaling

integer (default 1). Scaling factor estimate the signal density as: scaling * 'DIMP-Count-Per-Bp'. For example, if scaling = 1000, then signal density denotes the number of DIMPs in 1000 bp.

pvalCutOff

cutoff used then a p-value adjustment is performed

saveAll

if TRUE all the temporal results that passed filters 'varFilter' and are 'meanFilter' returned. If FALSE, only the comparisons that passed filters 'varFilter', 'meanFilter', and pvalue < pvalCutOff or adj.pvalue < pvalCutOff (if pAdjustMethod is not NULL) are returned.

num.cores

The number of cores to use, i.e. at most how many child processes will be run simultaneously (see bplapply function from BiocParallel).

tasks

integer(1). The number of tasks per job. Value must be a scalar integer >= 0L. In this documentation a job is defined as a single call to a function, such as bplapply, bpmapply etc. A task is the division of the X argument into chunks. When tasks == 0 (default), X is divided as evenly as possible over the number of workers (see MulticoreParam-class from BiocParallel package).

verbose

if TRUE, prints the function log to stdout

...

Additional parameters passed to glm function.

Details

The default parameter setting glm.family = Gamma(link = 'log') is thought to perform the group comparison of the sums of absolute differences of methylation levels (total variation distance (TVD) at gene-body DIMPs on DMGs). The sums of Hellinger divergence (HD, at gene-body DIMPs on DMGs) can be tested with this setting as well. Both TVD and HD follow asymptotic Chi-square distribution and, consequently, so do the sum of TVD and the sum of HD. The Chi-square distribution is a particular case of Gamma distribution:

f(x|a,s) = 1/(s^a Gamma(a)) x^(a-1) e^-(x/s)

Chi-square density is derived after replacing a = n/2 and s = 2:

f(x|n) = 1/(2^(n/2) Gamma(n/2)) x^(n/2-1) e^(-x/2)

Value

The original GRanges object with the columns 'beta', 'log2FC', 'pvalue', 'adj.pval' (if pAdjustMethod requested), 'CT.divPerBp' and 'TT.divPerBp' (divergence per base pairs), and 'divPerBpVariation added.

Examples

## Gene annotation
genes <- GRanges(seqnames = '1',
                 ranges = IRanges(start = c(3631, 6788, 11649),
                                  end = c(5899, 9130, 13714)),
                 strand = c('+', '-', '-'))
mcols(genes) <- data.frame(gene_id = c('AT1G01010', 'AT1G01020',
                                       'AT1G01030'))
# === The number of cytosine sites to generate ===
sites = 11001
# == Set a seed for pseudo-random number generation ===
set.seed(123)
alpha.ct <- 0.09
alpha.tt <- 0.2
# === Simulate samples ===
ref = simulateCounts(num.samples = 2, sites = sites, alpha = alpha.ct,
                   beta = 0.5, size = 50, theta = 4.5, sample.ids = 'R1')

# Control group
ctrl = simulateCounts(num.samples = 2, sites = sites, alpha = alpha.ct,
                       beta = 0.5, size = 50, theta = 4.5,
                       sample.ids = c('C1', 'C2'))
# Treatment group
treat = simulateCounts(num.samples = 2, sites = sites, alpha = alpha.tt,
                        beta = 0.5, size = 50, theta = 4.5,
                        sample.ids = c('T1', 'T2'))

#  === Estime Divergences ===
HD = estimateDivergence(ref = ref$R1, indiv = c(ctrl, treat),
                        Bayesian = TRUE, num.cores = 1L, percentile = 1,
                        verbose = FALSE)

nlms <- nonlinearFitDist(HD, column = 4, verbose = FALSE)

## Next, the potential signal can be estimated
PS <- getPotentialDIMP(LR = HD, nlms = nlms, div.col = 4, alpha = 0.05)

## The cutpoint estimation used to discriminate the signal from the noise
cutpoints <- estimateCutPoint(PS, control.names = c('C1', 'C2'),
                              treatment.names = c('T1', 'T2'),
                              div.col = 4, verbose = FALSE)
## DIMPs are selected using the cupoints
DIMPs <- selectDIMP(PS, div.col = 9, cutpoint = min(cutpoints$cutpoint))

## Finally DIMPs statistics genes
tv_DIMPs = getGRegionsStat(GR = DIMPs, grfeatures = genes, stat = 'sum',
                           absolute = TRUE, column = 7L)

GR_tv_DIMP = uniqueGRanges(tv_DIMPs, type = 'equal', chromosomes = '1')
colnames(mcols(GR_tv_DIMP)) <-  c('C1', 'C2', 'T1', 'T2')

res <- divTest(GR=GR_tv_DIMP, control.names =  c('C1', 'C2'),
               treatment.names = c('T1', 'T2'))

genomaths/MethylIT.utils documentation built on July 4, 2023, 12:05 a.m.