BioMM: BioMM end-to-end prediction

View source: R/BioMM.R

BioMMR Documentation

BioMM end-to-end prediction

Description

The BioMM framework uses two-stage machine learning models that can allow us to integrate prior biological knowledge for end-to-end phenotype prediction.

Usage

BioMM(
  trainData,
  testData,
  pathlistDB,
  featureAnno,
  restrictUp,
  restrictDown,
  minPathSize,
  supervisedStage1 = TRUE,
  typePCA,
  resample1 = "BS",
  dataMode = "allTrain",
  repeatA1 = 100,
  repeatA2 = 1,
  repeatB1 = 20,
  repeatB2 = 1,
  nfolds = 10,
  FSmethod1,
  FSmethod2,
  cutP1,
  cutP2,
  fdr2,
  FScore = MulticoreParam(),
  classifier,
  predMode,
  paramlist,
  innerCore = MulticoreParam()
)

Arguments

trainData

The input training dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

testData

The input test dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

pathlistDB

A list of pathways with pathway IDs and their corresponding genes ('entrezID' is used). This is only used for pathway-based stratification (only stratify is 'pathway').

featureAnno

The annotation data stored in a data.frame for probe mapping. It must have at least two columns named 'ID' and 'entrezID'. If it's NULL, then the input probe is from the transcriptomic data. (Default: NULL)

restrictUp

The upper-bound of the number of probes or genes in each biological stratified block.

restrictDown

The lower-bound of the number of probes or genes in each biological stratified block.

minPathSize

The minimal defined pathway size after mapping your own data to GO database. This is only used for pathway-based stratification (only stratify is 'pathway').

supervisedStage1

A logical value. If TRUE, then supervised learning models are applied; if FALSE, unsupervised learning.

typePCA

the type of PCA. Available options are c('regular', 'sparse').

resample1

The resampling methods at stage-1. Valid options are 'CV' and 'BS'. 'CV' for cross validation and 'BS' for bootstrapping resampling. The default is 'BS'.

dataMode

The input training data mode for model training. It is used only if 'testData' is present. It can be a subset of the whole training data or the entire training data. 'subTrain' is the given for subsetting and 'allTrain' for the entire training dataset.

repeatA1

The number of repeats N is used during resampling procedure. Repeated cross validation or multiple boostrapping is performed if N >=2. One can choose 10 repeats for 'CV' and 100 repeats for 'BS'.

repeatA2

The number of repeats N is used during resampling prediction. The default is 1 for 'CV'.

repeatB1

The number of repeats N is used for generating stage-2 test data prediction scores. The default is 20.

repeatB2

The number of repeats N is used for test data prediction. The default is 1.

nfolds

The number of folds is defined for cross validation. The default is 10.

FSmethod1

Feature selection methods at stage-1. Available options are c(NULL, 'positive', 'wilcox.test', 'cor.test', 'chisq.test', 'posWilcox').

FSmethod2

Feature selection methods at stage-2. Features that are positively associated with the outcome will be used.

cutP1

The cutoff used for p value thresholding at stage-1. Commonly used cutoffs are c(0.5, 0.1, 0.05, 0.01, etc). If "FSmethod1" is NULL, Then no cutoff is applied. If FSmethod = "posTopCor", cutP is defined as the number of most correlated features with 'fdr' = NULL.

cutP2

The cutoff used for p value thresholding at stage-2. Commonly used cutoffs are c(0.5, 0.1, 0.05, 0.01, etc). If "FSmethod2" is NULL, Then no cutoff is applied. If FSmethod = "posTopCor", cutP is defined as the number of most correlated features with 'fdr' = NULL.

fdr2

Multiple testing correction method at stage-2. Available options are c(NULL, 'fdr', 'BH', 'holm', etc). See also p.adjust. The default is NULL. This option is useful particularly when large sets of pathways are investigated.

FScore

The number of cores used for feature selection.

classifier

Machine learning classifiers at both stages. Available options are c('randForest', 'SVM', 'glmnet').

predMode

The prediction mode at both stages. Available options are c('probability', 'classification', 'regression').

paramlist

A list of model parameters at both stages. The set of parameters are different for each classifier. Please see the detailed parameters are implemented for each individual classifier, e.g., 'baseRandForest()', 'baseSVM()', and 'baseGLMnet()'.

innerCore

The number of cores used for computation. It needs to be reconciled with "FScore" depending on the number of cores available.

Details

Stage-2 training data can be learned either using bootstrapping or cross validation resampling methods in the supervised learning settting. Stage-2 test data is learned via independent test set prediction.

Value

The CV or BS predicted score for the training data and test set predicted score if testData is given.

References

Chen, J., & Schwarz, E. (2017). BioMM: Biologically-informed Multi-stage Machine learning for identification of epigenetic fingerprints. arXiv preprint arXiv:1712.00336.

Perlich, C., & Swirszcz, G. (2011). On cross-validation and stacking: Building seemingly predictive models on random data. ACM SIGKDD Explorations Newsletter, 12(2), 11-15.

See Also

reconBySupervised; reconByUnsupervised; BioMMstage2pred

Examples

 
## Load data    
methylfile <- system.file('extdata', 'methylData.rds', package='BioMM')  
methylData <- readRDS(methylfile)    
testData <- NULL
## Annotation file
probeAnnoFile <- system.file('extdata', 'cpgAnno.rds', package='BioMM')  
probeAnno <- readRDS(file=probeAnnoFile)     
golist <- readRDS(system.file("extdata", "goDB.rds", package="BioMM")) 
pathlistDB <- golist[1:100]
supervisedStage1=TRUE
classifier <- 'randForest'
predMode <- 'classification'
paramlist <- list(ntree=300, nthreads=30)   
library(BiocParallel)
library(ranger)
param1 <- MulticoreParam(workers = 2)
param2 <- MulticoreParam(workers = 20)
## Not Run 
## result <- BioMM(trainData=methylData, testData=NULL,
##                 pathlistDB, featureAnno=probeAnno, 
##                 restrictUp=200, restrictDown=10, minPathSize=10, 
##                 supervisedStage1, typePCA='regular', 
##                 resample1='BS', resample2='CV', dataMode="allTrain",
##                 repeatA1=20, repeatA2=1, repeatB1=20, repeatB2=1, 
##                 nfolds=10, FSmethod1=NULL, FSmethod2=NULL, 
##                 cutP1=0.1, cutP2=0.1, fdr2=NULL, FScore=param1, 
##                 classifier, predMode, paramlist, innerCore=param2)
## if (is.null(testData)) {
##     predY <- result 
##     trainDataY <- methylData[,1]
##     metricCV <- getMetrics(dataY = trainDataY, predY)
##     message("Cross-validation prediction performance:")
##     print(metricCV)
## } else if (!is.null(testData)){
##     trainDataY <- methylData[,1]
##     testDataY <- testData[,1]
##     cvYscore <- result[[1]]
##     testYscore <- result[[2]] 
##     metricCV <- getMetrics(dataY = trainDataY, cvYscore)
##     metricTest <- getMetrics(dataY = testDataY, testYscore)
##     message("Cross-validation performance:")
##     print(metricCV)
##     message("Test set prediction performance:")
##     print(metricTest)
## }

transbioZI/BioMM documentation built on Jan. 12, 2023, 2:18 p.m.