boostme: Function for training and and imputing with a boostme model.
In lulizou/boostme: BoostMe - DNA methylation prediction within WGBS

View source: R/boostme.R

boostme

R Documentation

Function for training and and imputing with a boostme model.

Description

Uses the xgboost framework (C) Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang. Sample average feature requires at least 3 samples in the bsseq object.

Usage

boostme(bs, imputeAndReplace = TRUE, randomCpGs = FALSE,
  trainChr = "chr1", validateChr = "chr22", testChr = "chr2",
  trainSize = 1e+06, validateSize = 1e+06, testSize = 1e+06,
  minCov = 10, sampleAvg = TRUE, neighbMeth = TRUE, neighbDist = TRUE,
  featureBEDs = NULL, threads = 2, save = NULL)

Arguments

`bs`	a bsseq object containing the methylation & coverage values as well as the features loaded into `pData(bs)`. If no features are loaded into `pData(bs)`, the model will simply use neighboring CpGs and the sample average of the other CpGs.
`randomCpGs`	boolean of whether or not to select a simple random sample of CpGs genome-wide or not. Default is FALSE. If TRUE, will ignore the trainChr, validateChr, and testChr parameters and select CpGs for the training, validation, and test sets at random. Can modify how large each of these sets will be individually using the trainSize, validateSize, and testSize parameters. Defaults are 1 million CpGs each. NOTE: this takes way longer to do than simply dividing by chromosome, and achieves similar accuracy.
`trainChr`	which chromosome(s) to use for training. default = chr3 (approximately 1.5 million CpGs). Note that the more CpGs used for training, the more memory required to train and store the model.
`validateChr`	which chromosome(s) to use for validation. default = chr22 (approximately 550,000 CpGs).
`testChr`	which chromosome(s) to use for testing. default = chr4 (approximately 1.4 million CpGs).
`trainSize`	integer of how many CpGs to use for train set. Default is 1 million. NOTE: only kicks into effect when randomCpGs = TRUE.
`validateSize`	integer of how many CpGs to use for validation set. Default is 1 million. NOTE: only kicks into effect when randomCpGs = TRUE.
`testSize`	integer of how many CpGs to use for test set. Default is 1 million. NOTE: only kicks into effect when randomCpGs = TRUE.
`minCov`	the minimum coverage required to consider a methylation value trainable, default is 10 (i.e. 10 total reads at a CpG). Also used as the cutoff below which to impute and replace the methylation value, given that none of the features used for that CpG are NA. E.g. if a CpG has coverage of 2, but sampleAvg = TRUE and < 2 samples have coverage >=10 for that CpG, then that CpG's value will not be imputed and replaced.
`sampleAvg`	boolean of whether to not to include the sample average as a feature. Default is TRUE.
`neighbMeth`	boolean of whether or not to include nearest non-missing neighboring CpG methylation values. Default is TRUE.
`neighbDist`	boolean of whether or not to include nearest non-missing neighboring CpG distances. Default is TRUE.
`featureBEDs`	optional vector of paths to BED files to be included as features in the model. All columns past the third column are automatically considered to be features. If the column has multiple factors (i.e. multiple different strings) then each factor is converted to its own binary feature (1 if present, else 0)
`threads`	(optional) number of threads to use for training. default = 2
`save`	(optional) file path to save metrics to (e.g. results.txt)
`impute`	boolean of whether or not to impute CpG methylation values below the minCov. Default is TRUE. Set to FALSE if want to do a dry run and see the RMSE for each sample.