divTest: Group Comparisons of Information Divergences Based on...
In genomaths/MethylIT.utils: MethylIT utility

divTest

R Documentation

Group Comparisons of Information Divergences Based on Generalized Linear Model

Description

Generalized Linear Model for group comparison of information divergence variables yielded by function estimateDivergence from MethylIT R package output. Basically, this a wrapping function to perform the fitting of generalized linear models with glm from 'stats' package to any variable of interest given in GRanges objects of estimateDivergence output.

Usage

divTest(
  GR,
  control.names,
  treatment.names,
  glm.family = Gamma(link = "log"),
  var.weights = FALSE,
  weights = NULL,
  varFilter = 0,
  meanFilter = 0,
  FilterLog2FC = TRUE,
  Minlog2FC = 1,
  divPerBp = 0.001,
  minInd = 2,
  pAdjustMethod = NULL,
  scaling = 1L,
  pvalCutOff = 0.05,
  saveAll = FALSE,
  num.cores = 1,
  tasks = 0L,
  verbose = TRUE,
  ...
)

Arguments

`GR`	GRanges objects including control and treatment samples containing an information divergence of methylation levels. The names for each column must coincide with the names given for parameters: 'control.names' and 'treatment.names'.
`control.names`	Names/IDs of the control samples, which must be included in the variable GR in a metacolumn.
`treatment.names`	Names/IDs of the treatment samples, which must be included in the variable GR in a metacolumn.
`glm.family, link`	Parameter to be passed to function `glm`. A description of the error distribution and link function to be used in the model. For `glm` this can be a character string naming a family function, or the result of a call to a family function. For `glm`.fit only the third option is supported. (See`family` function). Default: glm.family=Gamma(link ='log').
`var.weights`	Logical (default: FALSE). Whether to use group variances as weights.
`weights`	An optional list of two numeric vectors of ‘prior weights’ to be used in the fitting process. One vector of weights for the control and one for the treatment. Each vector with length equal to length(GR) (default: NULL). Non-NULL weights can be used to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions).
`varFilter`	Numeric (default: 0). GLM will be performed only for those rows (ranges denoting genomic regions) where the group variance is greater the number specified by varFilter.
`meanFilter`	Numeric (default: 0). GLM will be performed only for those rows (ranges denoting genomic regions) where the absolute difference of group means is greater the number specified by meanFilter.
`FilterLog2FC`	if TRUE, the results are filtered using the minimun absolute value of log2FoldChanges observed to accept that a gene in the treatment is differentially expressed in respect to the control.
`Minlog2FC`	minimum logarithm base 2 of fold changes
`divPerBp`	At least for one group the mean divergence per bp must be equal to or greater than 'divPerBp' (default divPerBp = 0.001).
`minInd`	Integer (Default: 2). At least one group must have 'minInd' individuals with a divergence value greater than zero.
`pAdjustMethod`	Method used to adjust the results; default: 'NULL' (see `p.adjust`.methods). The p-value adjustment is performed using function `p.adjust`.
`scaling`	integer (default 1). Scaling factor estimate the signal density as: scaling * 'DIMP-Count-Per-Bp'. For example, if scaling = 1000, then signal density denotes the number of DIMPs in 1000 bp.
`pvalCutOff`	cutoff used then a p-value adjustment is performed
`saveAll`	if TRUE all the temporal results that passed filters 'varFilter' and are 'meanFilter' returned. If FALSE, only the comparisons that passed filters 'varFilter', 'meanFilter', and pvalue < pvalCutOff or adj.pvalue < pvalCutOff (if pAdjustMethod is not NULL) are returned.
`num.cores`	The number of cores to use, i.e. at most how many child processes will be run simultaneously (see `bplapply` function from BiocParallel).
`tasks`	integer(1). The number of tasks per job. Value must be a scalar integer >= 0L. In this documentation a job is defined as a single call to a function, such as bplapply, bpmapply etc. A task is the division of the X argument into chunks. When tasks == 0 (default), X is divided as evenly as possible over the number of workers (see `MulticoreParam-class` from BiocParallel package).
`verbose`	if TRUE, prints the function log to stdout
`...`	Additional parameters passed to `glm` function.

Details

The default parameter setting glm.family = Gamma(link = 'log') is thought to perform the group comparison of the sums of absolute differences of methylation levels (total variation distance (TVD) at gene-body DIMPs on DMGs). The sums of Hellinger divergence (HD, at gene-body DIMPs on DMGs) can be tested with this setting as well. Both TVD and HD follow asymptotic Chi-square distribution and, consequently, so do the sum of TVD and the sum of HD. The Chi-square distribution is a particular case of Gamma distribution:

f(x|a,s) = 1/(s^a Gamma(a)) x^(a-1) e^-(x/s)

Chi-square density is derived after replacing a = n/2 and s = 2:

f(x|n) = 1/(2^(n/2) Gamma(n/2)) x^(n/2-1) e^(-x/2)

Value

The original GRanges object with the columns 'beta', 'log2FC', 'pvalue', 'adj.pval' (if pAdjustMethod requested), 'CT.divPerBp' and 'TT.divPerBp' (divergence per base pairs), and 'divPerBpVariation added.

Examples

## Gene annotation
genes <- GRanges(seqnames = '1',
                 ranges = IRanges(start = c(3631, 6788, 11649),
                                  end = c(5899, 9130, 13714)),
                 strand = c('+', '-', '-'))
mcols(genes) <- data.frame(gene_id = c('AT1G01010', 'AT1G01020',
                                       'AT1G01030'))
# === The number of cytosine sites to generate ===
sites = 11001
# == Set a seed for pseudo-random number generation ===
set.seed(123)
alpha.ct <- 0.09
alpha.tt <- 0.2
# === Simulate samples ===
ref = simulateCounts(num.samples = 2, sites = sites, alpha = alpha.ct,
                   beta = 0.5, size = 50, theta = 4.5, sample.ids = 'R1')

# Control group
ctrl = simulateCounts(num.samples = 2, sites = sites, alpha = alpha.ct,
                       beta = 0.5, size = 50, theta = 4.5,
                       sample.ids = c('C1', 'C2'))
# Treatment group
treat = simulateCounts(num.samples = 2, sites = sites, alpha = alpha.tt,
                        beta = 0.5, size = 50, theta = 4.5,
                        sample.ids = c('T1', 'T2'))

#  === Estime Divergences ===
HD = estimateDivergence(ref = ref$R1, indiv = c(ctrl, treat),
                        Bayesian = TRUE, num.cores = 1L, percentile = 1,
                        verbose = FALSE)

nlms <- nonlinearFitDist(HD, column = 4, verbose = FALSE)

## Next, the potential signal can be estimated
PS <- getPotentialDIMP(LR = HD, nlms = nlms, div.col = 4, alpha = 0.05)

## The cutpoint estimation used to discriminate the signal from the noise
cutpoints <- estimateCutPoint(PS, control.names = c('C1', 'C2'),
                              treatment.names = c('T1', 'T2'),
                              div.col = 4, verbose = FALSE)
## DIMPs are selected using the cupoints
DIMPs <- selectDIMP(PS, div.col = 9, cutpoint = min(cutpoints$cutpoint))

## Finally DIMPs statistics genes
tv_DIMPs = getGRegionsStat(GR = DIMPs, grfeatures = genes, stat = 'sum',
                           absolute = TRUE, column = 7L)

GR_tv_DIMP = uniqueGRanges(tv_DIMPs, type = 'equal', chromosomes = '1')
colnames(mcols(GR_tv_DIMP)) <-  c('C1', 'C2', 'T1', 'T2')

res <- divTest(GR=GR_tv_DIMP, control.names =  c('C1', 'C2'),
               treatment.names = c('T1', 'T2'))

genomaths/MethylIT.utils documentation built on July 4, 2023, 12:05 a.m.