Building and evaluating a small gene signature from MammaPrint data using sig2trial

The goal of the sig2trial package is to build small, interpetable gene signatures in a standardized fashion. To this end, we have locked down many of the decisions that one makes in the course of creating and examining a prediction algorithm.

Let's first load the library

library(sig2trial)

With the library we have included two datasets curated by Marchionni et. al. (2013, BMC Genomics). These data were used to train and test the MammaPrint 70-gene signature for predicting the risk of distal recurrence in breast cancer (Van de Vijver et. al., 2002, NEJM). We remapped the gene IDs to gene symbols. This is already done in the data, so this code need not be run...this is just for reference. We use the org.Hs.eg.db library from bioconductor to perform the annotations.

# Install annotation library from bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite("org.Hs.eg.db")

# Load annotation library
library(org.Hs.eg.db)

# This database maps Entrez Gene IDs to Gene Symbols
y <- org.Hs.egSYMBOL
mapped_genes <- mappedkeys(y)
yy <- as.list(y[mapped_genes])

# This database maps GenBank accession numbers to Entrez Gene IDs
z <- org.Hs.egACCNUM2EG
mapped_genes <- mappedkeys(z)
zz <- as.list(z[mapped_genes])

# These contain unique IDs or GenBank accession numbers in the training dataset
tmp_rn <- fData(glasEset)$Comment.AEReporterName

new_rn <- vector("character", length(rn))

for(i in 1:length(tmp_rn)){
    cur <- zz[[tmp_rn[i]]]
    cur_sym <- ifelse(is.null(cur), tmp_rn[i], yy[[cur]])
    new_rn[i] <- ifelse(is.null(cur_sym), tmp_rn[i], cur_sym)
}

# Manual touch-ups for duplicated gene symbols (duplicate accession numbers or gene symbols are possible)
new_rn[147] <- fData(glasEset)$Reporter.Database.Entry.embl[147]
new_rn[160] <- fData(glasEset)$Comment.AEReporterName[160]
new_rn[890] <- fData(glasEset)$Comment.AEReporterName[890] # This is a duplicate MTMR2
new_rn[939] <- fData(glasEset)$Comment.AEReporterName[939] # This is a duplicate BAIAP2

# Reassigne row names
rownames(glasEset) <- rownames(buyseEset) <- new_rn

To use the sig2trial package, we need only familiarize ourselves with the tspreg_report function. This function minimally takes as input a training data set of gene expression values, an outcome of interest (currently binary outcomes are fully supported), and the number of Top-Scoring Pair features we would like to constrain to in our final model. TSPs are rank-based features that operate on pairwise comparisons of genes (Geman et al, Stat. Appl. in Gen., 2004). We then go through the steps of building a decision tree predictor for this outcome and provide the tree plus an HTML report of the procedure. In the report we additionally state an approximate out-of-sample accuracy as well as a sense of the value of this predictor as an adjustment covariate in a clinical trial (the "trial" part of the package). The value is represented as the percent gain in precision we might expect if we estimated a treatment effect between two arms of a clinical trial and adjusted our estiamtor with information from our predictions. The percent gain stated here is merely a quick, crude estimate of this value.

Below, we provide the glasEset data as the training data and also provide the buyseEset as validation data. The tree we train will be applied to the validation data, allowing us to get a better sense of how the predictor might perform on an external dataset. We specify 5 pairs (npair = 5) as a maximum number of TSP features to use in our tree, and we provide a file path and report tile (optional).

tree <- tspreg_report(data=exprs(glasEset), outcome=pData(glasEset)$FiveYearMetastasis, val=exprs(buyseEset), val_outcome=pData(buyseEset)$FiveYearRecurrence, npair=5, filepath="mammaprint_report.html", title="MammaPrint Report")

The report generated by this command appears in the "vignettes" folder in the sig2trial package.

Vignette Info

Note the various macros within the vignette setion of the metadata block above. These are required in order to instruct R how to build the vignette. Note that you should change the title field and the \VignetteIndexEntry to match the title of your vignette.

Styles

The html_vignette template includes a basic CSS theme. To override this theme you can specify your own CSS in the document metadata as follows:

output: 
  rmarkdown::html_vignette:
    css: mystyles.css

Figures

The figure sizes have been customised so that you can easily put two images side-by-side.

plot(1:10)
plot(10:1)

You can enable figure captions by fig_caption: yes in YAML:

output:
  rmarkdown::html_vignette:
    fig_caption: yes

Then you can use the chunk option fig.cap = "Your figure caption." in knitr.

More Examples

You can write math expressions, e.g. $Y = X\beta + \epsilon$, footnotes^[A footnote here.], and tables, e.g. using knitr::kable().

knitr::kable(head(mtcars, 10))

Also a quote using >:

"He who gives up [code] safety for [code] speed deserves neither." (via)



leekgroup/sig2trial documentation built on May 20, 2019, 11:31 p.m.