buildPredictor: Run nested cross-validation on data
In BaderLab/netDx: Network-based patient classifier

Description Usage Arguments Details Value Examples

Run nested cross-validation on data

buildPredictor(
  dataList,
  groupList,
  outDir = tempdir(),
  makeNetFunc,
  featScoreMax = 10L,
  trainProp = 0.8,
  numSplits = 10L,
  numCores,
  JavaMemory = 4L,
  featSelCutoff = 9L,
  keepAllData = FALSE,
  startAt = 1L,
  preFilter = FALSE,
  impute = FALSE,
  preFilterGroups = NULL,
  imputeGroups = NULL,
  logging = "default",
  debugMode = FALSE
)

`dataList`	(MultiAssayExperiment) sample metadata. Clinical data is in colData() and other input datatypes are in assays() slot. names(groupList) should match names(assays(dataList)). The only exception is clinical data. If a groupList entry is called "clinical", the algorithm will search for corresponding variable names in colData(dataList) (i.e. columns of sample metadata table).
`groupList`	(list of lists) keys are datatypes, and values are lists indicating how units for those datatypes are to be grouped. Keys must match names(assays(dataList)). The only exception is for clinical values. Variables for "clinical" will be extracted from columns of the sample metadata table (i.e. from colData(dataList)). e.g. groupList[["rna"]] could be a list of pathway definitions. So keys(groupList[["rna"]]) would have pathway names, generating one PSN per pathways, and values(groupList[["rna"]]) would be genes that would be grouped for the corresponding pathwayList.
`outDir`	(char) directory where results will be stored. If this directory exists, its contents will be overwritten. Must be absolute path
`makeNetFunc`	(function) user-defined function for creating the set of input PSN provided to netDx. See createPSN_MultiData()::customFunc.
`featScoreMax`	(integer) number of CV folds in inner loop
`trainProp`	(numeric 0 to 1) Percent samples to use for training
`numSplits`	(integer) number of train/blind test splits (i.e. iterations of outer loop)
`numCores`	(integer) number of CPU cores for parallel processing
`JavaMemory`	(integer) memory in (Gb) used for each fold of CV
`featSelCutoff`	(integer) cutoff for inner-fold CV to call feature-selected in a given split
`keepAllData`	(logical) if TRUE keeps all intermediate files, even those not needed for assessing the predictor. Use very cautiously as for some designs, each split can result in using 1Gb of data.
`startAt`	(integer) which of the splits to start at (e.g. if the job aborted part-way through)
`preFilter`	(logical) if TRUE uses lasso to prefilter dataList within cross-validation loop. Only variables that pass lasso get included. The current option is not recommended for pathway-level features as most genes will be eliminated by lasso. Future variations may allow other prefiltering options that are more lenient.
`impute`	(logical) if TRUE applies imputation by median within CV
`preFilterGroups`	(char) vector with subset of names(dataList) to which prefiltering needs to be limited. Allows users to indicate which data layers should be prefiltered using regression and which are to be omitted from this process. Prefiltering uses regression, which omits records with missing values. Structured missingness can result in empty dataframes if missing values are removed from these, which in turn can crash the predictor. To impute missing data, see the 'impute' and 'imputeGroups' parameters.
`imputeGroups`	(char) If impute set to TRUE, indicate which groups you want imputed.
`logging`	(char) level of detail with which messages are printed. Options are: 1) none: turn off all messages; 2) all: greatest level of detail (recommended for advanced users, or for debugging); 3) default: print key details (useful setting for most users)
`debugMode`	(logical) when TRUE runs jobs in serial instead of parallel and prints verbose messages. Also prints system Java calls and prints all standard out and error output associated with these calls.

wrapper function to run netDx with nested cross-validation, with an inner loop of X-fold cross-validation and an outer loop of different random splits of data into train and blind test. The user needs to supply a custom function to create PSN, see createPSN_MultiData(). This wrapper provides flexibility for designs with one or several heterogeneous data types, and one or more ways of defining patient similarity. For example, designs it handles includes 1) Single datatype, single similarity metric: Expression data -> pathways 2) Single datatype, multiple metrics: Expression data -> pathways (Pearson corr) and single gene networks (normalized difference) 3) Multiple datatypes, multiple metrics: Expression -> Pathways; Clinical -> single or grouped nets

symmetric matrix of size ncol(dat) (number of patients) containing pairwise patient similarities

(list) "inputNets": data.frame of all input network names. Columns are "NetType" (group) and "NetName" (network name). "Split<i>" is the data for train/test split i (i.e. one per train/test split). Each "SplitX" entry contains in turn a list of results for that split. Key-value pairs are: 1) predictions: real and predicted labels for test patients 2) accuracy: percent accuracy of predictions 3) featureScores: list of length g, where g is number of patient classes. scores for all features following feature selection, for corresponding class. 4) featureSelected: list of length g (num patient classes). List of selected features for corresponding patient class, for that train/test split. Side effect of generating predictor-related data in <outDir>.

library(curatedTCGAData)
library(MultiAssayExperiment)
curatedTCGAData(diseaseCode="BRCA", assays="*",dry.run=TRUE,version="1.1.38")

# fetch mrna, mutation data
brca <- curatedTCGAData("BRCA",c("mRNAArray"),FALSE,version="1.1.38")

# get subtype info
pID <- colData(brca)$patientID
pam50 <- colData(brca)$PAM50.mRNA
staget <- colData(brca)$pathology_T_stage
st2 <- rep(NA,length(staget))
st2[which(staget %in% c("t1","t1a","t1b","t1c"))] <- 1
st2[which(staget %in% c("t2","t2a","t2b"))] <- 2
st2[which(staget %in% c("t3","t3a"))] <- 3
st2[which(staget %in% c("t4","t4b","t4d"))] <- 4
pam50[which(!pam50 %in% "Luminal A")] <- "notLumA"                         
pam50[which(pam50 %in% "Luminal A")] <- "LumA"
colData(brca)$ID <- pID
colData(brca)$STAGE <- st2                                                 
colData(brca)$STATUS <- pam50

# keep only tumour samples
idx <- union(which(pam50 == "Normal-like"), which(is.na(st2)))
cat(sprintf("excluding %i samples\n", length(idx)))
                                                                           
tokeep <- setdiff(pID, pID[idx])
brca <- brca[,tokeep,]

pathList <- readPathways(fetchPathwayDefinitions(month=10,year=2020))
brca <- brca[,,1] # keep only clinical and mRNA data

# remove duplicate arrays
smp <- sampleMap(brca)
samps <- smp[which(smp$assay=="BRCA_mRNAArray-20160128"),]
notdup <- samps[which(!duplicated(samps$primary)),"colname"]
brca[[1]] <- brca[[1]][,notdup]

groupList <- list()
groupList[["BRCA_mRNAArray-20160128"]] <- pathList[seq_len(3)]
groupList[["clinical"]] <- list(
age="patient.age_at_initial_pathologic_diagnosis",
 stage="STAGE")
makeNets <- function(dataList, groupList, netDir,...) {
    netList <- c()
    # make RNA nets: group by pathway
    if (!is.null(groupList[["BRCA_mRNAArray-20160128"]])) {
    netList <- makePSN_NamedMatrix(dataList[["BRCA_mRNAArray-20160128"]],
                rownames(dataList[["BRCA_mRNAArray-20160128"]]),
                groupList[["BRCA_mRNAArray-20160128"]],
                netDir,verbose=FALSE,
                writeProfiles=TRUE,...)
    netList <- unlist(netList)
    cat(sprintf("Made %i RNA pathway nets\n", length(netList)))
    }

    # make clinical nets,one net for each variable
    netList2 <- c()
    if (!is.null(groupList[["clinical"]])) {
    netList2 <- makePSN_NamedMatrix(dataList$clinical,
        rownames(dataList$clinical),
        groupList[["clinical"]],netDir,
        simMetric="custom",customFunc=normDiff, # custom function
        writeProfiles=FALSE,
        sparsify=TRUE,verbose=TRUE,...)
    }
    netList2 <- unlist(netList2)
    cat(sprintf("Made %i clinical nets\n", length(netList2)))
    netList <- c(netList,netList2)
    cat(sprintf("Total of %i nets\n", length(netList)))
    return(netList)
}

# takes 10 minutes to run
#out <- buildPredictor(dataList=brca,groupList=groupList,
#   makeNetFunc=makeNets, ### custom network creation function
#   outDir=paste(tempdir(),"pred_output",sep=getFileSep()), ## absolute path
#   numCores=16L,featScoreMax=2L, featSelCutoff=1L,numSplits=2L)