In this example we will build a predictor to classify breast tumours as being either of Luminal A subtype or otherwise. The process is identical for classifying three or more labels, and the example uses minimal data for quick runtime.
For this we will use data from the The Cancer Genome Atlas, and will integrate three types of -omic data:
knitr::opts_chunk$set(crop=NULL) knitr::include_graphics("images/vignette1_design.jpg")
In this example, we use curated multi-modal data from The Cancer Genome Atlas, gotten from the BioConductor curatedTCGAData
package. Data for all cancer types profiled in TCGA are available through this package; see this tutorial for details.
suppressMessages(library(curatedTCGAData))
In this example, we use curated data from The Cancer Genome Atlas, through the BioConductor curatedTCGAData
package. The goal is to classify a breast tumour into either a Luminal A subtype or otherwise. The predictor will integrate clinical variables selected by the user, along with gene expression data.
Here we load the required packages and download clinical and gene expression data.
suppressMessages(library(curatedTCGAData))
Let's take a look at the available data for breast cancer, without downloading any (set dry.run=TRUE
).
Note that the new release of BioConductor (3.13) actually allows users to fetch one of two versions of TCGA data.
curatedTCGAData(diseaseCode="BRCA", assays="*",dry.run=TRUE, version="1.1.38")
Now let's actually download the data, getting just the three layers we need:
brca <- suppressMessages(curatedTCGAData("BRCA", c("mRNAArray", "miRNASeqGene"), dry.run=FALSE, version="1.1.38"))
This call returns a MultiAssayExperiment
object. Recall that this is a container for storing multiple assays performed on the same set of samples. See this tutorial to learn more.
Let's briefly explore the brca
MultiAssayExperiment
object.
brca
assays()
returns a list
with all -omic data associated with this object.
summary(assays(brca))
names()
shows the datatypes in each slot of assays()
:
names(assays(brca))
So miRNA data is in slot 1, gene expression in slot 2, etc.,
We can subset the data to see what it looks like. Let's do that for the miRNA data, looking at just the first five measures:
mir <- assays(brca)[["BRCA_miRNASeqGene-20160128"]] head(mir[,1:5])
Patient metadata is contained in the colData()
slot. Rows contain data for each patient and columns contain measures such as clinical characteristics:
pheno <- colData(brca) colnames(pheno)[1:20] head(pheno[,1:5])
This next code block prepares the TCGA data. This includes:
ID
column in colData(brca)
, which contains unique patient IDsSTATUS
column in colData(brca)
which contains the patient labels (i.e what we want netDx to classify). In practice you would prepare the dataset once and save it to file, then separately load it before running netDx; i.e. decouple data processing and running the predictor. The data processing code has been moved into a supporting file, prepare_data.R
. You can explore it after the lab to see how some things are achieved (e.g. removing duplicate samples). For now, let's just run it.
source("prepare_data.R") brca <- prepareData(brca,setBinary=TRUE)
The important thing is to create ID
and STATUS
columns in the sample metadata table. netDx uses these to get the patient identifiers and labels, respectively.
pheno <- colData(brca) head(pheno[,c("ID","STATUS")]) table(pheno$STATUS,useNA="always") # good practice: useNA="always" shows missing values
Now let's set up the data for input to netDx.
netDx allows the user to define how data is converted into patient similarity networks (or PSNs), which are the features that go into the model. This is done specifically by telling the model how to:
The relevant input parameters are:
groupList
: sets of input data that would correspond to individual networks (e.g. genes grouped into pathways)makeNets()
: an R function telling netDx what similarity metric to use for each data layerLet's start by loading the netDx
package.
suppressWarnings(suppressMessages(require(netDx)))
Let's set up each of the input arguments one by one.
What is this: groupList
tells netDx how to group measures within a layer, to generate a PSN. Measures could be individual genes, proteins, CpG bases (in DNA methylation data), clinical variables, etc.,
In this simple example we just create a single PSN for each datatype, containing all measures from that datatype.
expr <- assays(brca) groupList <- list() for (k in 1:length(expr)) { # loop over all layers cur <- expr[[k]]; nm <- names(expr)[k] # all measures from this layer go into our single PSN groupList[[nm]] <- list(nm=rownames(cur)) # assign same layer name as in input data names(groupList[[nm]])[1] <- nm; }
Notice that groupList
is a two tiered list, or list-of-lists. The first tier is for each data layers, with names matching those in assays(brca)
. The second tier contains all the PSNs we want to make for that layer. In this lab exercise we create only one PSN per data layer, using all the measures from an -omic assay. e.g. One PSN based on similarity across entire transcriptome, one for methylome, etc.,). So in this lab exercise, the inner tier simply contains one entry, with all measures for the given layer.
This design will get more interesting in Lab 2, when we create pathway-level features.
Let's take a look at groupList
. Here is the first tier:
summary(groupList)
And the second tier:
names(groupList[["BRCA_mRNAArray-20160128"]]) length(groupList[["BRCA_mRNAArray-20160128"]][[1]]) head(groupList[["BRCA_mRNAArray-20160128"]][[1]])
The makeNets
function tells the predictor how to create networks from provided input data.
This function requires dataList
,groupList
, and netDir
as input variables. The residual ...
parameter is to pass additional variables to makePSN_NamedMatrix()
, notably numCores
(number of parallel jobs).
netDx requires that this function have:
dataList
,groupList
, and netDir
as input variables. The residual ...
parameter is to pass additional variables to makePSN_NamedMatrix()
, notably number of cores for parallel processing (numCores
). makeNets <- function(dataList, groupList, netDir,...) { netList <- c() # initialize before is.null() check layerNames <- c("BRCA_miRNASeqGene-20160128", "BRCA_mRNAArray-20160128") for (nm in layerNames){ ## for each layer if (!is.null(groupList[[nm]])){ ## must check for null for each layer netList_cur <- makePSN_NamedMatrix( dataList[[nm]], rownames(dataList[[nm]]), ## names of measures (e.g. genes, CpGs) groupList[[nm]], ## how to group measures in that layer netDir, ## leave this as-is, netDx will figure out where this is. verbose=FALSE, writeProfiles=TRUE, ## use Pearson correlation-based similarity ... ) netList <- c(netList,netList_cur) ## just leave this in } } return(unlist(netList)) ## just leave this in }
Finally we call the function that runs the netDx predictor. We provide:
dataList
)groupList
)makeNetFunc
)numSplits
), featScoreMax
, set to 10)featSelCutoff
); only features scoring this value or higher will be used to classify test patients,numCores
).The call below runs 10 train/test splits. Within each split, it:
trainProp=0.8
)featScoreMax=10L
)featSelCutoff=9L
) to classify test samples for that split.In practice a good starting point is featScoreMax=10
, featSelCutoff=9
and numSplits=10L
, but these parameters depend on the sample sizes in the dataset and heterogeneity of the samples.
This step can take a few hours based on the current parameters, so we're commenting this out for the tutorial and will simply load the results.
nco <- round(parallel::detectCores()*0.75) # use 75% available cores message(sprintf("Using %i of %i cores", nco, parallel::detectCores())) t0 <- Sys.time() set.seed(42) # make results reproducible outDir <- paste(tempdir(),randAlphanumString(), "pred_output",sep=getFileSep()) if (file.exists(outDir)) unlink(outDir,recursive=TRUE) model <- suppressMessages(buildPredictor( dataList=brca, ## your data groupList=groupList, ## grouping strategy makeNetFunc=makeNets, ## function to build PSNs outDir=outDir, ## output directory trainProp=0.8, ## pct of samples to use to train model in ## each split numSplits=2L, ## number of train/test splits featSelCutoff=1L, ## threshold for calling something ## feature-selected featScoreMax=2L, ## max score for feature selection numCores=nco, ## set higher for parallelizing debugMode=FALSE, keepAllData=FALSE, ## set to TRUE for debugging or low-level files used by the dictor logging="none" )) t1 <- Sys.time() print(t1-t0)
Now we get model output, including performance for various train/test splits and consistently high-scoring features.
In the function below, we define top-scoring features as those which score two out of two in at least half of the train/test splits:
results <- getResults(model,unique(colData(brca)$STATUS), featureSelCutoff=2L,featureSelPct=0.50)
results
contains performance
, selectedFeatures
for each patient label, and the table of feature scores
.
summary(results)
Look at the performance:
results$performance
Look at feature scores for all labels, across all train-test splits:
results$featureScores
Let's examine our confusion matrix:
confMat <- confusionMatrix(model)
Note: Rows of this matrix don't add up to 100% because the matrix is an average of the confusion matrices from all of the train/test splits.
And here are selected features, which are those scoring 2 out of 2 in at least half of the splits. This threshold is simply for illustration. In practice we would run at least 10 train/test splits (ideally 100+), and look for features that score 7+ out of 10 in >70% splits.
results$selectedFeatures
We finally get the integrated PSN and visualize it using a tSNE plot:
## this call doesn't work in Rstudio; for now we've commented this out and saved the PSN file. psn <- getPSN(brca,groupList,makeNets,results$selectedFeatures) require(Rtsne) tsne <- tSNEPlotter( psn$patientSimNetwork_unpruned, colData(brca) )
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.