r title

Introduction

The following code will build a decision tree model that uses top-scoring pair (TSP) features as predictors [@geman2004classifying]. TSPs are rank-based features that consider the relationship between two genes rather than their raw expression levels. For two genes $g_1$, $g_2$, the feature of interest is the indicator of whether or not $g_2$ is more highly expressed than $g_1$: I$(g_1 < g_2)$.

We will build a model based on pairs of this form to predict patient outcomes.

To begin, we set an initial random seed to ensure reproducibility of our results.

set.seed(seed)
predtype <- ifelse(is.factor(outcome), "class", "vector")
npair <- 5
ec_pairs <- 40

Training and Testing Data

If validation data have been supplied, we will build our model on the training data and make predictions the validation data. If not, we will split the provided dataset into training and testing sets, build the model on the training set, and predict on the testing set. These procedures ensure that our model does not overfit to our training data.

if(is.null(data$val)){
    ## Split training data into 3/4 train, 1/4 test
    idx <- sample(1:ncol(data$train), ncol(data$train)/4)
    train <- data$train[,-idx]
    test <- data$train[,idx]
    train_outcome <- data$outcome[-idx]
    test_outcome <- data$outcome[idx]
    if(!is.null(data$covar)){
        train_covar <- data$covar[-idx,]
        test_covar <- data$covar[idx,]
    } else {
        train_covar <- test_covar <- covar # all NULL
    }
} else {
    train <- data$train
    test <- data$val
    train_outcome <- data$outcome
    test_outcome <- data$val_outcome
    train_covar <- data$covar
    test_covar <- data$val_covar
}

Feature Selection

We employ two steps to find the most predictive gene pair features for our model. We need these feature selection steps because we cannot easily consider all pairwise combinations of genes.

1) Empirical Controls feature selection - This step is a filtration step that does not yet consider the relationship between features and the outcome. We want gene pairs $g_1$, $g_2$ such that the TSP I$(g_1 < g_2)$ is informative. Our quick approach to finding candidate predictors is to focus on pairs where one gene does not vary much between the classes in our outcome while the other varies a lot. To find pairs like this, we rank all of the genes by their average expression and break them up into groups by quantiles. In each quantile, we find the most and least variable genes and create all possible pairs out of those sets. Over all quantiles, we get an initial set of pairs to consider. Through trials across many datasets, we have found that taking the top and bottom 40 such pairs qithin each quantile provides a good initial feature base.

pairs <- empirical_controls(train, ec_pairs)

## Throw out pairs that do not flip between classes - all 0s or all 1s - if they exist
rmp <- which(rowMeans(pairs) == 1 | rowMeans(pairs) == 0)
if(length(rmp) > 0){
    pairs <- pairs[-rmp,]
}

2) Regression-based feature selection - In step 2, we pare down to a final set of features of a predetermined size, this time using the outcome to help guide us. We judge how predictive a particular pair is for the outcome via a series of regression models. For each subsequent pair, we include all previously chosen pairs in our regression model. This allows us to select the pair that offers the most additional information about the outcome taking the ones we have already chosen into account.

Cross-validation

We use 5-fold cross-validation to get an estimate of the out-of-sample accuracy of our predictor. This means that we further break up the training data into training and testing subsets and do the same procedure described above. We break up the data 5 different times and determine the accuracy of the model we build each time. We then take the average of these accuracies as an indication of how well we think our final model, which will be built on the entire traning data set, will do on an external data set.

The following code executes the entire TSP model-building procedure.

## This function builds the decision tree on the entire test data and provideds an out-of-sample accuracy estimate.
model_out <- tsp_model_builder(train, train_outcome, train_covar, pairs, test, test_covar, npair, predtype)

Results

Decision Tree Model

Displayed below is the final decision tree prediction model built using the training data. The pairs that were chosen are labeled below. Note that fewer than the desired number of pairs may appear in the final model - the set chosen is most parsimonious.

Think of the decision tree as a cascading set of choices that result in a predicted probability of the outcome occurring. Start at the top - for a particular patient, the first question is whether the gene 2 in the pair at the top of the tree is more highly expressed than gene 1. If this is true, then we drop down to the left and examine the next decision. If this is false, then we drop down to the right. We keep considering subsequent decisions and go left or right until we reach a terminal point that does not continue to a next decision. At this point appears the probability of the outcome occurring for this particular patient.

# If we had rattle...
drawTreeNodes(model_out$display_tree, digits=2)
#plot(model_out$display_tree)

Gene pairs at each node in the model:

df <- data.frame("Tree Node Label" = model_out$pair_names, "Feature Name" = gsub("<", " \\< ", model_out$final_names))
kable(df)
##cat(paste0(model_out$pairnames, ": ", model_out$final_names, "\n", collapse="\n"))
````

## Out-of-Sample Accuracy

The accuracy displayed here represents our estimate of the accuracy of our predictor in an external dataset. We obtained this estimate from cross-validation, as
described above.

```r
cat(paste0("Out-of-sample accuracy: ", round(mean(model_out$acc), digits=3)))

Resubstitution and Test/Validation ROC Curves

An ROC curve is a representation of the sensitivty and specificity (or true positive and true negative rates) of our predictor. We present curves for the training and testing datasets. To obtain these curves, we make predictions with our final decision tree model for all patients in the training and testing sets. Then each patient is assigned a probability of the outcome occurring, from 0 to 1. We then take a threshold between 0 and 1 and say that anybody with a probability below this threshold will be classified as "no outcome occurring" and anybody above this threshold will be classified as "outcome occurring". We then compare our predictions to the true outcomes and calculate how many ocurrences we got correct (sensitivity) and how many non-occurences we got correct (specificity). To get the full curve, we vary the threshold from 0 to 1 and re-do this calculation for each threshold.

Our curves also come with pointwise 95% confidence intervals that are calculated via bootstrap. These give us a sense of how certain we are about the curve at every threshold value.

options(bitmapType="cairo")

lb <- rgb(154, 192, 205, alpha=0.6*255, maxColorValue=255)

roc_train <- plot.roc(train_outcome, model_out$p_train,  main=paste0("Training Data ROC (n=",length(train_outcome),")"), legacy.axes=T)
ci_train <- ci.se(roc_train, progress="none")
plot(ci_train, type="shape", col=lb)

title <- ifelse(is.null(val), "Test Data ROC", "Validation Data ROC")
title <- paste0(title, " (n=", length(test_outcome), ")")

roc_test <- plot.roc(test_outcome, model_out$p_test,  main=title, legacy.axes=T)
ci_test <- ci.se(roc_test, progress="none")
plot(ci_test, type="shape", col=lb)

Precision Gain in a Clinincal Trial Setting

Our final goal is to get a sense of how much value this gene signature would provide in a clinical trial setting. We focus on this setting because we have methods, developed by Colantuoni et. Al [@colantuoni2015leveraging], that allow us to approximate how much more precise of a treatment effect estimate we might expect in a trial where we could apply this predictor to all subjects. It is one means of examining how much additional variability in our outcome of interest we might be able to explain with this gene signature.

sig_precision <- sig_precision_crude(test_outcome, model_out$p_test, test_covar)
sig_precision

The results above suggest that if we adjusted for our predictor and estimated a treatment effect in a clinical trial for our outcome of interest, we could improve the precision of our estimator by r round(sig_precision*100,digits=2)%.

# Here, we return the tree model as output
return_elements <- model_out$tree

End Matter

sig2trial report v0.9 This report was built using the sig2trial R package.

Authors: Prasad Patil, Jeff Leek

R session info:

sessionInfo()


leekgroup/sig2trial documentation built on May 20, 2019, 11:31 p.m.