Building and evaluating a small gene signature from MammaPrint data using tdsm

Introduction

Here we will run through an example analysis and how a user may choose to edit an existing template. We will use the TSP regression template.

Let's first load the library

library(tdsm)

Initial analysis

With the library we have included two datasets curated by Marchionni et. al. (2013, BMC Genomics). These data were used to train and test the MammaPrint 70-gene signature for predicting the risk of distal recurrence in breast cancer (Van de Vijver et. al., 2002, NEJM). We remapped the gene IDs to gene symbols. This is already done in the data, so this code need not be run...this is just here for reference. We use the org.Hs.eg.db library from bioconductor to perform the annotations.

# Install annotation library from bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite("org.Hs.eg.db")

# Load annotation library
library(org.Hs.eg.db)

# This database maps Entrez Gene IDs to Gene Symbols
y <- org.Hs.egSYMBOL
mapped_genes <- mappedkeys(y)
yy <- as.list(y[mapped_genes])

# This database maps GenBank accession numbers to Entrez Gene IDs
z <- org.Hs.egACCNUM2EG
mapped_genes <- mappedkeys(z)
zz <- as.list(z[mapped_genes])

# These contain unique IDs or GenBank accession numbers in the training dataset
tmp_rn <- fData(glasEset)$Comment.AEReporterName

new_rn <- vector("character", length(rn))

for(i in 1:length(tmp_rn)){
    cur <- zz[[tmp_rn[i]]]
    cur_sym <- ifelse(is.null(cur), tmp_rn[i], yy[[cur]])
    new_rn[i] <- ifelse(is.null(cur_sym), tmp_rn[i], cur_sym)
}

# Manual touch-ups for duplicated gene symbols (duplicate accession numbers or gene symbols are possible)
new_rn[147] <- fData(glasEset)$Reporter.Database.Entry.embl[147]
new_rn[160] <- fData(glasEset)$Comment.AEReporterName[160]
new_rn[890] <- fData(glasEset)$Comment.AEReporterName[890] # This is a duplicate MTMR2
new_rn[939] <- fData(glasEset)$Comment.AEReporterName[939] # This is a duplicate BAIAP2

# Reassigne row names
rownames(glasEset) <- rownames(buyseEset) <- new_rn

To conduct TSP regression in the tdsm package, we need only familiarize ourselves with the tspreg_report function. This function minimally takes as input a training data set of gene expression values, and an outcome of interest (currently binary outcomes are fully supported). TSPs are rank-based features that operate on pairwise comparisons of genes (Geman et al, Stat. Appl. in Gen., 2004). We then go through the steps of building a decision tree predictor for this outcome and provide the tree plus an HTML report of the procedure.

Below, we provide the glasEset data as the training data and also provide the buyseEset as validation data. The tree we train will be applied to the validation data, allowing us to get a better sense of how the predictor might perform on an external dataset. We also provide a file path and report tile (optional).

tree <- tspreg_report(data=exprs(glasEset), outcome=pData(glasEset)$FiveYearMetastasis, val=exprs(buyseEset), val_outcome=pData(buyseEset)$FiveYearRecurrence, filepath="mammaprint_report.html", title="MammaPrint Report")

The HTML report generated by this command ("mammaprint_report.html") appears in the "vignettes" folder in the tdsm package.

Editing Templates

If we examine the HTML report, in the initialization code chunk we notice two parameters of the analysis that are prespecified: npairs and ec_pairs. These parameters control the maximum number of TSPs that end up in the final decision tree model and the number of candidate pairs that are constructed within gene expression quantiles in the empirical controls feature selection step (described in the report).

Suppose that the user would like to modifiy these parameters. The procedure is as follows:

mytemplate_path <- duplicate_template("tsp")
# now we go and edit the template
mydiff_path <- diff_template("tsp", mytemplate_path)
submit_diff(mydiff_path)

tree2 <- tspreg_report(path = mytemplate_path, train=exprs(glasEset), outcome=pData(glasEset)$FiveYearMetastasis, val=exprs(buyseEset), val_outcome=pData(buyseEset)$FiveYearRecurrence, title="Edited MammaPrint Report")

First, the user copies the TSP template to a file of their choosing (note: please specify a ".Rmd" file extension when you choose where to copy the template). The path to the copied template is saved in "mytemplate_path". Next, the user can go and edit the file as they like. We changed the number of maximum pairs in the model to 2 and changed ec_pairs to 75.

The user can compare their edited template to the original using "diff_tempalte". The output will be saved as HTML, so please specify a .html extension to the file you choose to save the input. The path to this file is saved in "mydiff_path". If the user would like, they can upload the diff output as an anonymous Github Gist. They can then share this URL easily to describe the changes they made. In the future, we would like to somehow capture this action.

Finally, we can re-run the "tspreg_report" command. This time, we provide the path to our edited template in the "path" variable. The same procedure as before occurs, only the edited template is used to conduct the analysis. The result of this step ("mam_edited.html") is also saved in the "vignettes" folder of the tdsm package, along with the diff HTML output ("mam_diff.html").



prpatil/tdsm documentation built on May 26, 2019, 10:32 a.m.