BioMM: BioMM end-to-end prediction
In transbioZI/BioMM: BioMM: Biological-informed Multi-stage Machine learning framework for phenotype prediction using omics data

BioMM

R Documentation

BioMM end-to-end prediction

Description

The BioMM framework uses two-stage machine learning models that can allow us to integrate prior biological knowledge for end-to-end phenotype prediction.

Usage

BioMM(
  trainData,
  testData,
  pathlistDB,
  featureAnno,
  restrictUp,
  restrictDown,
  minPathSize,
  supervisedStage1 = TRUE,
  typePCA,
  resample1 = "BS",
  dataMode = "allTrain",
  repeatA1 = 100,
  repeatA2 = 1,
  repeatB1 = 20,
  repeatB2 = 1,
  nfolds = 10,
  FSmethod1,
  FSmethod2,
  cutP1,
  cutP2,
  fdr2,
  FScore = MulticoreParam(),
  classifier,
  predMode,
  paramlist,
  innerCore = MulticoreParam()
)

Arguments

`trainData`	The input training dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.
`testData`	The input test dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.
`pathlistDB`	A list of pathways with pathway IDs and their corresponding genes ('entrezID' is used). This is only used for pathway-based stratification (only `stratify` is 'pathway').
`featureAnno`	The annotation data stored in a data.frame for probe mapping. It must have at least two columns named 'ID' and 'entrezID'. If it's NULL, then the input probe is from the transcriptomic data. (Default: NULL)
`restrictUp`	The upper-bound of the number of probes or genes in each biological stratified block.
`restrictDown`	The lower-bound of the number of probes or genes in each biological stratified block.
`minPathSize`	The minimal defined pathway size after mapping your own data to GO database. This is only used for pathway-based stratification (only `stratify` is 'pathway').
`supervisedStage1`	A logical value. If TRUE, then supervised learning models are applied; if FALSE, unsupervised learning.
`typePCA`	the type of PCA. Available options are c('regular', 'sparse').
`resample1`	The resampling methods at stage-1. Valid options are 'CV' and 'BS'. 'CV' for cross validation and 'BS' for bootstrapping resampling. The default is 'BS'.
`dataMode`	The input training data mode for model training. It is used only if 'testData' is present. It can be a subset of the whole training data or the entire training data. 'subTrain' is the given for subsetting and 'allTrain' for the entire training dataset.
`repeatA1`	The number of repeats N is used during resampling procedure. Repeated cross validation or multiple boostrapping is performed if N >=2. One can choose 10 repeats for 'CV' and 100 repeats for 'BS'.
`repeatA2`	The number of repeats N is used during resampling prediction. The default is 1 for 'CV'.
`repeatB1`	The number of repeats N is used for generating stage-2 test data prediction scores. The default is 20.
`repeatB2`	The number of repeats N is used for test data prediction. The default is 1.
`nfolds`	The number of folds is defined for cross validation. The default is 10.
`FSmethod1`	Feature selection methods at stage-1. Available options are c(NULL, 'positive', 'wilcox.test', 'cor.test', 'chisq.test', 'posWilcox').
`FSmethod2`	Feature selection methods at stage-2. Features that are positively associated with the outcome will be used.
`cutP1`	The cutoff used for p value thresholding at stage-1. Commonly used cutoffs are c(0.5, 0.1, 0.05, 0.01, etc). If "FSmethod1" is NULL, Then no cutoff is applied. If FSmethod = "posTopCor", cutP is defined as the number of most correlated features with 'fdr' = NULL.
`cutP2`	The cutoff used for p value thresholding at stage-2. Commonly used cutoffs are c(0.5, 0.1, 0.05, 0.01, etc). If "FSmethod2" is NULL, Then no cutoff is applied. If FSmethod = "posTopCor", cutP is defined as the number of most correlated features with 'fdr' = NULL.
`fdr2`	Multiple testing correction method at stage-2. Available options are c(NULL, 'fdr', 'BH', 'holm', etc). See also `p.adjust`. The default is NULL. This option is useful particularly when large sets of pathways are investigated.
`FScore`	The number of cores used for feature selection.
`classifier`	Machine learning classifiers at both stages. Available options are c('randForest', 'SVM', 'glmnet').
`predMode`	The prediction mode at both stages. Available options are c('probability', 'classification', 'regression').
`paramlist`	A list of model parameters at both stages. The set of parameters are different for each classifier. Please see the detailed parameters are implemented for each individual classifier, e.g., 'baseRandForest()', 'baseSVM()', and 'baseGLMnet()'.
`innerCore`	The number of cores used for computation. It needs to be reconciled with "FScore" depending on the number of cores available.

Details

Stage-2 training data can be learned either using bootstrapping or cross validation resampling methods in the supervised learning settting. Stage-2 test data is learned via independent test set prediction.

Value

The CV or BS predicted score for the training data and test set predicted score if testData is given.

References

Chen, J., & Schwarz, E. (2017). BioMM: Biologically-informed Multi-stage Machine learning for identification of epigenetic fingerprints. arXiv preprint arXiv:1712.00336.

Perlich, C., & Swirszcz, G. (2011). On cross-validation and stacking: Building seemingly predictive models on random data. ACM SIGKDD Explorations Newsletter, 12(2), 11-15.

Examples

 
## Load data    
methylfile <- system.file('extdata', 'methylData.rds', package='BioMM')  
methylData <- readRDS(methylfile)    
testData <- NULL
## Annotation file
probeAnnoFile <- system.file('extdata', 'cpgAnno.rds', package='BioMM')  
probeAnno <- readRDS(file=probeAnnoFile)     
golist <- readRDS(system.file("extdata", "goDB.rds", package="BioMM")) 
pathlistDB <- golist[1:100]
supervisedStage1=TRUE
classifier <- 'randForest'
predMode <- 'classification'
paramlist <- list(ntree=300, nthreads=30)   
library(BiocParallel)
library(ranger)
param1 <- MulticoreParam(workers = 2)
param2 <- MulticoreParam(workers = 20)
## Not Run 
## result <- BioMM(trainData=methylData, testData=NULL,
##                 pathlistDB, featureAnno=probeAnno, 
##                 restrictUp=200, restrictDown=10, minPathSize=10, 
##                 supervisedStage1, typePCA='regular', 
##                 resample1='BS', resample2='CV', dataMode="allTrain",
##                 repeatA1=20, repeatA2=1, repeatB1=20, repeatB2=1, 
##                 nfolds=10, FSmethod1=NULL, FSmethod2=NULL, 
##                 cutP1=0.1, cutP2=0.1, fdr2=NULL, FScore=param1, 
##                 classifier, predMode, paramlist, innerCore=param2)
## if (is.null(testData)) {
##     predY <- result 
##     trainDataY <- methylData[,1]
##     metricCV <- getMetrics(dataY = trainDataY, predY)
##     message("Cross-validation prediction performance:")
##     print(metricCV)
## } else if (!is.null(testData)){
##     trainDataY <- methylData[,1]
##     testDataY <- testData[,1]
##     cvYscore <- result[[1]]
##     testYscore <- result[[2]] 
##     metricCV <- getMetrics(dataY = trainDataY, cvYscore)
##     metricTest <- getMetrics(dataY = testDataY, testYscore)
##     message("Cross-validation performance:")
##     print(metricCV)
##     message("Test set prediction performance:")
##     print(metricTest)
## }

transbioZI/BioMM documentation built on Jan. 12, 2023, 2:18 p.m.