runSOMNiBUS: Wrapper function running the smoothed-EM algorithm to...

View source: R/runSOMNiBUS.R

runSOMNiBUSR Documentation

Wrapper function running the smoothed-EM algorithm to estimate covariate effects and test regional association in Bisulfite Sequencing-derived methylation data

Description

This function splits the methylation data into regions (according to different approaches) and, for each region, fits a (dispersion-adjusted) binomial regression model to regional methylation data, and reports the estimated smooth covariate effects and regional p-values for the test of DMRs (differentially methylation regions). Over or under dispersion across loci is accounted for in the model by the combination of a multiplicative dispersion parameter (or scale parameter) and a sample-specific random effect.

This method can deal with outcomes, i.e. the number of methylated reads in a region, that are contaminated by known false methylation calling rate (p0) and false non-methylation calling rate (1-p1).

The covariate effects are assumed to smoothly vary across genomic regions. In order to estimate them, the algorithm first represents the functional parameters by a linear combination of a set of restricted cubic splines (with dimension n.k), and a smoothness penalization term which depends on the smoothing parameters lambdas is also added to control smoothness. The estimation is performed by an iterated EM algorithm. Each M step constitutes an outer Newton's iteration to estimate smoothing parameters lambdas and an inner P-IRLS iteration to estimate spline coefficients alpha for the covariate effects. Currently, the computation in the M step depends the implementation of gam() in package mgcv.

Usage

runSOMNiBUS(
  dat,
  split = list(approach = "region"),
  min.cpgs = 50,
  max.cpgs = 2000,
  n.k,
  p0 = 0.003,
  p1 = 0.9,
  Quasi = TRUE,
  epsilon = 10^(-6),
  epsilon.lambda = 10^(-3),
  maxStep = 200,
  binom.link = "logit",
  method = "REML",
  covs = NULL,
  RanEff = TRUE,
  reml.scale = FALSE,
  scale = -2,
  verbose = TRUE
)

Arguments

dat

a data frame with rows as individual CpGs appearing in all the samples. The first 4 columns should contain the information of Meth_Counts (methylated counts), Total_Counts (read depths), Position (Genomic position for the CpG site) and ID (sample ID). The covariate information, such as disease status or cell type composition, are listed in column 5 and onwards.

split

this list must contain at least the element approach which corresponds to the partitioning approach used to split the data into independent regions. The partitioning methods available are:

  • "region" (partitioning based on the spacing of CpGs),

  • "density" (partitioning based on CpG density),

  • "chromatin" (partitioning based on chromatin states),

  • "gene" (partitioning based on gene regions),

  • "granges" (partitioning based on user-specific annotations provided as a GenomicRanges object),

  • "bed" (partitioning based on user-specific annotations provided in a BED file).

This list should also contain additional parameters specific to each partitioning approach (see the documentation of each approach for details).

min.cpgs

positive integer defining the minimum number of CpGs within a region for the algorithm to perform optimally. The default value is 50.

max.cpgs

positive integer defining the maximum number of CpGs within a region for the algorithm to perform optimally. The default value is 2000.

n.k

a vector of basis dimensions for the intercept and individual covariates. n.k specifies an upper limit of the degrees of each functional parameters. The length of n.k should equal to the number of covariates plus 1 (for the intercept)). We recommend basis dimensions n.k, approximately equal to the number of unique CpGs in the region divided by 20. This parameter will be computed automatically, when several regions are generated by the partitioning function.

p0

the probability of observing a methylated read when the underlying true status is unmethylated. p0 is the rate of false methylation calls, i.e. false positive rate.

p1

the probability of observing a methylated read when the underlying true status is methylated. 1-p1 is the rate of false non-methylation calls, i.e. false negative rate.

Quasi

whether a Quasi-likelihood estimation approach will be used; in other words, whether a multiplicative dispersion is added in the model or not.

epsilon

numeric; stopping criterion for the closeness of estimates of spline coefficients from two consecutive iterations.

epsilon.lambda

numeric; stopping criterion for the closeness of estimates of smoothing parameter lambda from two consecutive iterations.

maxStep

the algorithm will step if the iteration steps exceed maxStep.

binom.link

the link function used in the binomial regression model; the default is the logit link.

method

the method used to estimate the smoothing parameters. The default is the 'REML' method which is generally better than prediction based criterion GCV.cp.

covs

a vector of covariate names. The covariates with names in covs will be included in the model and their covariate effects will be estimated. The default is to fit all covariates in dat

RanEff

whether sample-level random effects are added or not

reml.scale

whether a REML-based scale (dispersion) estimator is used. The default is Fletcher-based estimator.

scale

negative values mean scale parameter should be estimated; if a positive value is provided, a fixed scale will be used.

verbose

logical indicates if the algorithm should provide progress report information. The default value is TRUE.

Value

This function returns a list of models (one by independent region) including objects:

  • est: estimates of the spline basis coefficients alpha

  • lambda: estimates of the smoothing parameters for each functional parameters

  • est.pi: predicted methylation levels for each row in the input data

  • ite.points: estimates of est, lambda at each EM iteration

  • cov1: estimated variance-covariance matrix of the basis coefficients alphas

  • reg.out: regional testing output obtained using Fletcher-based dispersion estimate; an additional 'ID' row would appear if RanEff is TRUE

  • reg.out.reml.scale: regional testing output obtained using REML-based dispersion estimate;

  • reg.out.gam: regional testing output obtained using (Fletcher-based) dispersion estimate from mgcv package;

  • phi_fletcher: Fletcher-based estimate of the (multiplicative) dispersion parameter;

  • phi_reml: REML-based estimate of the (multiplicative) dispersion parameter;

  • phi_gam: Estimated dispersion parameter reported by mgcv;

  • SE.out: a matrix of the estimated pointwise Standard Errors (SE); number of rows are the number of unique CpG sites in the input data and the number of columns equal to the total number of covariates fitted in the model (the first one is the intercept);

  • SE.out.REML.scale: a matrix of the estimated pointwise Standard Errors (SE); the SE calculated from the REML-based dispersion estimates

  • uni.pos: the genomic postions for each row of CpG sites in the matrix SE.out;

  • Beta.out: a matrix of the estimated covariate effects beta(t), where t denotes the genomic positions;

  • ncovs: number of functional paramters in the model (including the intercept);

  • sigma00: estimated variance for the random effect if RanEff is TRUE; NA if RanEff is FALSE.

Author(s)

Audrey Lemaçon

Examples

#------------------------------------------------------------#
data(RAdat)
RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ])
outs <- runSOMNiBUS(
  dat=RAdat.f, split = list(approach = "region", gap = 1e6), min.cpgs = 5,
  n.k = rep(5,3), p0 = 0.003, p1 = 0.9
)


kaiqiong/SOMNiBUS documentation built on Feb. 24, 2023, 5:38 a.m.