Introduction

In this example, we will use clinical data and three types of 'omic data - gene expression, DNA methylation and proteomic data - for binary classification of breast tumours. We also use several strategies and definitions of similarity to create features.

Setup

Load the netDx package.

suppressWarnings(suppressMessages(require(netDx)))

Data

For this example we pull data from the The Cancer Genome Atlas through the BioConductor curatedTCGAData package. The fetch command automatically brings in a MultiAssayExperiment object.

suppressMessages(library(curatedTCGAData))

We use the curatedTCGAData() command to look at available assays in the breast cancer dataset.

curatedTCGAData(diseaseCode="BRCA", assays="*",dry.run=TRUE,version="1.1.38")

In this call we fetch only the gene expression, proteomic and methylation data; setting dry.run=FALSE initiates the fetching of the data.

brca <- suppressMessages(
   curatedTCGAData("BRCA",
               c("mRNAArray","RPPA*","Methylation_methyl27*"),
    dry.run=FALSE,version="1.1.38"))

This next code block prepares the TCGA data. In practice you would do this once, and save the data before running netDx, but we run it here to see an end-to-end example.

source("prepare_data.R")
brca <- prepareData(brca,setBinary=TRUE)

The important thing is to create ID and STATUS columns in the sample metadata slot. netDx uses these to get the patient identifiers and labels, respectively.

pID <- colData(brca)$patientID
colData(brca)$ID <- pID

Rules to create features (patient similarity networks)

Our plan is to group gene expression data by pathways and clinical data by single variables. We will treat methylation and proteomic data each as a single feature, so each of those groups will contain the entire input table for those corresponding data types.

In the code below, we fetch pathway definitions for January 2018 from (http://download.baderlab.org/EM_Genesets) and group gene expression data by pathways. To keep the example short, we limit to only three pathways, but in practice you would use all pathways meeting a size criterion; e.g. those containing between 10 and 500 genes.

Grouping rules are accordingly created for the clinical, methylation and proteomic data.

groupList <- list()

# genes in mRNA data are grouped by pathways
pathFile <- sprintf("%s/extdata/pathway_ex3.gmt", path.package("netDx"))
pathList <- readPathways(pathFile)
groupList[["BRCA_mRNAArray-20160128"]] <- pathList
# clinical data is not grouped; each variable is its own feature
groupList[["clinical"]] <- list(
      age="patient.age_at_initial_pathologic_diagnosis",
       stage="STAGE"
)
# for methylation generate one feature containing all probes
# same for proteomics data
tmp <- list(rownames(experiments(brca)[[2]]));
names(tmp) <- names(brca)[2]
groupList[[names(brca)[2]]] <- tmp

tmp <- list(rownames(experiments(brca)[[3]]));
names(tmp) <- names(brca)[3]
groupList[[names(brca)[3]]] <- tmp

Define patient similarity for each network

We provide netDx with a custom function to generate similarity networks (i.e. features). The first block tells netDx to generate correlation-based networks using everything but the clinical data. This is achieved by the call:

makePSN_NamedMatrix(..., writeProfiles=TRUE,...)`

The second block makes a different call to makePSN_NamedMatrix() but this time, requesting the use of the normalized difference similarity metric. This is achieved by calling:

   makePSN_NamedMatrix(,..., 
                       simMetric="custom", customFunc=normDiff,
                       writeProfiles=FALSE)

normDiff is a function provided in the netDx package, but the user may define custom similarity functions in this block of code and pass those to makePSN_NamedMatrix(), using the customFunc parameter.

makeNets <- function(dataList, groupList, netDir,...) {
    netList <- c() # initialize before is.null() check
    # correlation-based similarity for mRNA, RPPA and methylation data
    # (Pearson correlation)
    for (nm in setdiff(names(groupList),"clinical")) {
       # NOTE: the check for is.null() is important!
        if (!is.null(groupList[[nm]])) {
        netList <- makePSN_NamedMatrix(dataList[[nm]],
                     rownames(dataList[[nm]]),
                   groupList[[nm]],netDir,verbose=FALSE,
                     writeProfiles=TRUE,...) 
        }
    }

    # make clinical nets (normalized difference)
    netList2 <- c()
    if (!is.null(groupList[["clinical"]])) {
    netList2 <- makePSN_NamedMatrix(dataList$clinical, 
        rownames(dataList$clinical),
        groupList[["clinical"]],netDir,
        simMetric="custom",customFunc=normDiff, # custom function
        writeProfiles=FALSE,
        sparsify=TRUE,verbose=TRUE,...)
    }
    netList <- c(unlist(netList),unlist(netList2))
    return(netList)
}

Build predictor

Finally we make the call to build the predictor.

set.seed(42) # make results reproducible

# location for intermediate work
# set keepAllData to TRUE to not delete at the end of the 
# predictor run.
# This can be useful for debugging.
nco <- round(parallel::detectCores()*0.75) # use 75% available cores
message(sprintf("Using %i of %i cores", nco, parallel::detectCores()))
outDir <- paste(tempdir(),"pred_output",sep=getFileSep()) # use absolute path
numSplits <- 2L
out <- suppressMessages(
   buildPredictor(dataList=brca,groupList=groupList,
      makeNetFunc=makeNets,
      outDir=outDir, ## netDx requires absolute path
      numSplits=numSplits, featScoreMax=2L, featSelCutoff=1L,
      numCores=nco)
)

Examine output

The results are stored in the list object returned by the buildPredictor() call. This list contains:

summary(out)
summary(out$Split1)

Compute accuracy for three-way classificationL

# Average accuracy
st <- unique(colData(brca)$STATUS) 
acc <- matrix(NA,ncol=length(st),nrow=numSplits) 
colnames(acc) <- st 
for (k in 1:numSplits) { 
    pred <- out[[sprintf("Split%i",k)]][["predictions"]];
    tmp <- pred[,c("ID","STATUS","TT_STATUS","PRED_CLASS",
                     sprintf("%s_SCORE",st))]
    for (m in 1:length(st)) {
       tmp2 <- subset(tmp, STATUS==st[m])
       acc[k,m] <- sum(tmp2$PRED==tmp2$STATUS)/nrow(tmp2)
    }
}
print(round(acc*100,2))

Also, examine the confusion matrix. We can see that the model perfectly classifies basal tumours, but performs poorly in distinguishing between the two types of luminal tumours.

res <- out$Split1$predictions
print(table(res[,c("STATUS","PRED_CLASS")]))

sessionInfo

sessionInfo()


BaderLab/netDx documentation built on Sept. 26, 2021, 9:13 a.m.