remp: Repetitive element methylation prediction

View source: R/remp.R

rempR Documentation

Repetitive element methylation prediction

Description

remp is used to predict genomewide methylation levels of locus-specific repetitive elements (RE). Two major RE types in human, Alu element (Alu) and LINE-1 (L1) are available.

Usage

remp(
  methyDat = NULL,
  REtype = c("Alu", "L1", "ERV"),
  Seq.GR = NULL,
  parcel = NULL,
  work.dir = tempdir(),
  win = 1000,
  method = c("rf", "xgbTree", "svmLinear", "svmRadial", "naive"),
  autoTune = TRUE,
  param = NULL,
  seed = NULL,
  ncore = NULL,
  BPPARAM = NULL,
  verbose = FALSE
)

Arguments

methyDat

A RatioSet, GenomicRatioSet, DataFrame, data.table, data.frame, or matrix of Illumina BeadChip methylation data (450k or EPIC array) or Illumina methylation percentage estimates by sequencing. See Details. Alternatively, user can also specify a pre-built data template (see rempTemplate). remp to carry out the prediction. See rempTemplate. With template specified, methyDat, REtype, parcel, and work.dir can be skipped.

REtype

Type of RE. Currently "Alu", "L1", and "ERV" are supported. If NULL, the type of RE will be extracted from parcel.

Seq.GR

A GRanges object containing genomic locations of the CpGs profiled by sequencing platforms. This parameter should not be NULL if the input methylation data methyDat are obtained by sequencing. Note that the genomic location can be in either hg19 or hg38 build. See details in initREMP.

parcel

An REMParcel object containing necessary data to carry out the prediction. If NULL, REtype must specify a type of RE so that the function can search the .rds data file in work.dir exported by initREMP (with export = TRUE) or saveParcel.

work.dir

Path to the directory where the annotation data generated by initREMP are saved. Valid when the argument parcel is missing. If not specified, temporary directory tempdir() will be used. If specified, the directory path has to be the same as the one specified in initREMP or in saveParcel.

win

An integer specifying window size to confine the upstream and downstream flanking region centered on the predicted CpG in RE for prediction. Default = 1000. See Details.

method

Name of model/approach for prediction. Currently "rf" (Random Forest), "xgbTree" (Extreme Gradient Boosting), "svmLinear" (SVM with linear kernel), "svmRadial" (SVM with radial kernel), and "naive" (carrying over methylation values of the closest CpG site) are available. Default = "rf" (Random Forest). See Details.

autoTune

Logical parameter. If TRUE, a 3-time repeated 5-fold cross validation will be performed to determine the best model parameter. If FALSE, the param option (see below) must be specified. Default = TRUE. Auto-tune will be disabled using Random Forest. See Details.

param

A list specifying fixed model tuning parameter(s) (not applicable for Random Forest, see Details). For Extreme Gradient Boosting, param list must contain '$nrounds', '$max_depth', '$eta', '$gamma', '$colsample_bytree', '$min_child_weight', and '$subsample'. See xgbTree in package caret. For SVM, param list must contain '$C' (cost) for linear kernel or '$sigma' and '$C' for radial basis function kernel. This parameter is valid only when autoTune = FALSE.

seed

Random seed for Random Forest model for reproducible prediction results. Default is NULL, which generates a seed.

ncore

Number of cores used for parallel computing. By default, max number of cores available in the machine will be utilized. If ncore = 1, no parallel computing is allowed.

BPPARAM

An optional BiocParallelParam instance determining the parallel back-end to be used during evaluation. If not specified, default back-end in the machine will be used.

verbose

Logical parameter. Should the function be verbose?

Details

Before running remp, user should make sure the methylation data have gone through proper quality control, background correction, and normalization procedures. Both beta value and M value are allowed. Rows represents probes and columns represents samples. For array data, please make sure to have row names that specify the Illumina probe ID (i.e. cg00000029). For sequencing data, please provide the genomic location of CpGs in a GRanges obejct and specify it using Seq.GR parameter. win = 1000 is based on previous findings showing that neighboring CpGs are more likely to be co-modified within 1000 bp. User can specify narrower window size for slight improvement of prediction accuracy at the cost of less predicted RE. Window size greater than 1000 is not recommended as the machine learning models would not be able to learn much userful information for prediction but introduce noise. Random Forest model (method = "rf") is recommented as it offers more accurate prediction and it also enables prediction reliability functionality. Prediction reliability is estimated by conditional standard deviation using Quantile Regression Forest. Please note that if parallel computing is allowed, parallel Random Forest (powered by package ranger) will be used automatically. The performance of Random Forest model is often relatively insensitive to the choice of mtry. Therefore, auto-tune will be turned off using Random Forest and mtry will be set to one third of the total number of predictors. For SVM, if autoTune = TRUE, preset tuning parameter search grid can be access and modified using remp_options.

Value

A REMProduct object containing predicted RE methylation results.

See Also

See initREMP to prepare necessary annotation database before running remp.

Examples

# Obtain example Illumina example data (450k)
if (!exists("GM12878_450k")) 
  GM12878_450k <- getGM12878("450k")

# Make sure you have run 'initREMP' first. See ?initREMP.
if (!exists("remparcel")) {
  data(Alu.hg19.demo)
  remparcel <- initREMP(arrayType = "450k",
                        REtype = "Alu",
                        annotation.source = "AH",
                        genome = "hg19",
                        RE = Alu.hg19.demo,
                        ncore = 1,
                        verbose = TRUE)
}

# With data template pre-built. See ?rempTemplate.
if (!exists("template")) 
  template <- rempTemplate(GM12878_450k, 
                           parcel = remparcel, 
                           win = 1000, 
                           verbose = TRUE)

# Run remp with pre-built template:
remp.res <- remp(template, ncore = 1)

# Or run remp without pre-built template (identical results):
## Not run: 
  remp.res <- remp(GM12878_450k, 
                   REtype = "Alu", 
                   parcel = remparcel, 
                   ncore = 1,
                   verbose = TRUE)

## End(Not run)

remp.res
details(remp.res)
rempB(remp.res) # Methylation data (beta value)

# Extract CpG location information. 
# This accessor is inherit from class 'RangedSummarizedExperiment')
rowRanges(remp.res)

# RE annotation information
rempAnnot(remp.res)

# Add gene annotation
remp.res <- decodeAnnot(remp.res, type = "symbol")
rempAnnot(remp.res)

# (Recommended) Trim off less reliable prediction
remp.res <- rempTrim(remp.res)

# Obtain RE-level methylation (aggregate by mean)
remp.res <- rempAggregate(remp.res)
rempB(remp.res) # Methylation data (beta value)

# Extract RE location information
rowRanges(remp.res)

# Density plot across predicted RE
remplot(remp.res)


YinanZheng/REMP documentation built on May 14, 2022, 5:58 p.m.