boostme: Function for training and and imputing with a boostme model.

View source: R/boostme.R

boostmeR Documentation

Function for training and and imputing with a boostme model.

Description

Uses the xgboost framework (C) Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang. Sample average feature requires at least 3 samples in the bsseq object.

Usage

boostme(bs, imputeAndReplace = TRUE, randomCpGs = FALSE,
  trainChr = "chr1", validateChr = "chr22", testChr = "chr2",
  trainSize = 1e+06, validateSize = 1e+06, testSize = 1e+06,
  minCov = 10, sampleAvg = TRUE, neighbMeth = TRUE, neighbDist = TRUE,
  featureBEDs = NULL, threads = 2, save = NULL)

Arguments

bs

a bsseq object containing the methylation & coverage values as well as the features loaded into pData(bs). If no features are loaded into pData(bs), the model will simply use neighboring CpGs and the sample average of the other CpGs.

randomCpGs

boolean of whether or not to select a simple random sample of CpGs genome-wide or not. Default is FALSE. If TRUE, will ignore the trainChr, validateChr, and testChr parameters and select CpGs for the training, validation, and test sets at random. Can modify how large each of these sets will be individually using the trainSize, validateSize, and testSize parameters. Defaults are 1 million CpGs each. NOTE: this takes way longer to do than simply dividing by chromosome, and achieves similar accuracy.

trainChr

which chromosome(s) to use for training. default = chr3 (approximately 1.5 million CpGs). Note that the more CpGs used for training, the more memory required to train and store the model.

validateChr

which chromosome(s) to use for validation. default = chr22 (approximately 550,000 CpGs).

testChr

which chromosome(s) to use for testing. default = chr4 (approximately 1.4 million CpGs).

trainSize

integer of how many CpGs to use for train set. Default is 1 million. NOTE: only kicks into effect when randomCpGs = TRUE.

validateSize

integer of how many CpGs to use for validation set. Default is 1 million. NOTE: only kicks into effect when randomCpGs = TRUE.

testSize

integer of how many CpGs to use for test set. Default is 1 million. NOTE: only kicks into effect when randomCpGs = TRUE.

minCov

the minimum coverage required to consider a methylation value trainable, default is 10 (i.e. 10 total reads at a CpG). Also used as the cutoff below which to impute and replace the methylation value, given that none of the features used for that CpG are NA. E.g. if a CpG has coverage of 2, but sampleAvg = TRUE and < 2 samples have coverage >=10 for that CpG, then that CpG's value will not be imputed and replaced.

sampleAvg

boolean of whether to not to include the sample average as a feature. Default is TRUE.

neighbMeth

boolean of whether or not to include nearest non-missing neighboring CpG methylation values. Default is TRUE.

neighbDist

boolean of whether or not to include nearest non-missing neighboring CpG distances. Default is TRUE.

featureBEDs

optional vector of paths to BED files to be included as features in the model. All columns past the third column are automatically considered to be features. If the column has multiple factors (i.e. multiple different strings) then each factor is converted to its own binary feature (1 if present, else 0)

threads

(optional) number of threads to use for training. default = 2

save

(optional) file path to save metrics to (e.g. results.txt)

impute

boolean of whether or not to impute CpG methylation values below the minCov. Default is TRUE. Set to FALSE if want to do a dry run and see the RMSE for each sample.

Value

a matrix that has the imputed values (if imputeAndReplace is TRUE). Otherwise doesn't return anything; just prints RMSE for each sample (dry run).


lulizou/boostme documentation built on March 16, 2023, 7:35 a.m.