In ArtemSokolov/ABCmonster: Prediction of ABC-transporter-dependent efficacy using drug chemical structures

We begin by loading the relevant libraries and data.

library( tidyverse )
library( ABCmonster )
data( MACCSbinary )

Univariate analysis

As a precursor to training machine learning models, we first measure association with the sensitive / resistant labels by examining one feature at a time. This is done by applying Fisher's exact test. Because the test is applied multiple times, we also compute FDR using the Benjamini & Hochberg procedure.

## Look at training data only (as determined by Label not being NA)
P <- MACCSbinary %>% filter( !is.na(Label) ) %>%
    ## Apply Fisher's Exact test against the Label column for each MACCS feature
    summarize_at( vars(`MACCS(--8)`:`MACCS(165)`), ~fisher.test(.x, .y)$p.value, .$Label ) %>%
    ## Compute the associated FDR
    gather( Feature, pval ) %>% mutate( FDR = p.adjust(pval, "fdr") ) %>% arrange(FDR)

Looking at the most significant features, we note that six MACCS keys have an FDR below 0.05.

filter( P, FDR < 0.05 )

Let's plot contingency tables for these keys.

## Isolate the training data and reshape the data frame to be in long format
CT <- MACCSbinary %>% filter( !is.na(Label) ) %>%
    select( -Drug, -pubchem_id ) %>% gather( Feature, Count, -Label ) %>%
    ## Merge against previously-computed p-values and select features with FDR < 0.05
    inner_join( P, by="Feature" ) %>% filter( FDR < 0.05 ) %>%
    ## Compute contingency table for each Label / Feature pair
    group_by( Label, Feature ) %>% summarize( Yes = sum(Count), No = n() - sum(Count) ) %>%
    ungroup %>% gather( Category, Count, -Label, -Feature )

## Define the size legend to be used by the plot
sg <- guide_legend( title = "# Drugs", override.aes=list(fill="gray50") )

## Plot the contingency tables
ggplot( CT, aes( y = Category, x = Label, size = Count, fill = Label, alpha = Category ) ) +
    geom_point( shape=21, color="black" ) + theme_bw() +
    facet_wrap( ~Feature, ncol=2 ) +
    scale_size_continuous( range = c(2,22), breaks = c(15, 25, 50, 100, 200 ),
                          guide = sg ) +
    scale_fill_manual( values = c( Resistant="tomato", Sensitive="steelblue" ), guide=FALSE ) +
    scale_alpha_manual( values = c( No=0.5, Yes=1 ), guide=FALSE ) +
    xlab( "Effective against ABC-16" ) + ylab( "Key present" )

Cross-validation

Before making predictions on the test data, we would like to select the appropriate algorithm. To do so, we run cross-validation to compare a number of popular methods: k-nearest neighbors (k-NN), gradient-boosted random forests (GBM), logistic regression with elastic net regularization (Log.Reg.), a support vector machine (SVM), and a neural network (NNet). The ABCmonster package provides a convenient function to evaluate all five methods on the same train-test split.

## Fix the random seed to ensure reproducibility of this vignette
set.seed( 1 )
CV <- ABCcv( MACCSbinary )

Let's look at the first few rows of the output

head(CV)

ABCcv returns the probability that a sample is Sensitive, as computed by individual methods when the sample is withheld in a test fold. The probabilities are averaged across parameter grids that are specific to individual methods. We can now use this information to construct ROC curves and compute AUC values. Note that in the paper, cross-validation is repeated 100 times to build robust estimates.

To construct an ROC curve, we use the columns Label and Pred to compute running true positive and false positive rates.

ROC <- CV %>% group_by( Method ) %>% arrange( desc(Pred) ) %>%
    mutate( tpr = cumsum(Label == "Sensitive") / sum(Label == "Sensitive"),
           fpr = cumsum(Label == "Resistant") / sum(Label == "Resistant") )
gg <- ggplot( ROC, aes(x=fpr, y=tpr, color=Method) ) + theme_bw() +
    geom_smooth(se=FALSE) + geom_abline( slope=1, linetype="dashed" ) +
    xlab( "False Positive Rate" ) + ylab( "True Positive Rate" ) +
    scale_color_manual( values=c("tomato", "#C3B56B", "salmon4", "darkolivegreen", "steelblue") )
plotly::ggplotly(gg)

library( plotly )
ROC <- CV %>% group_by( Method ) %>% arrange( desc(Pred) ) %>%
    mutate( tpr = cumsum(Label == "Sensitive") / sum(Label == "Sensitive"),
           fpr = cumsum(Label == "Resistant") / sum(Label == "Resistant") )
gg <- ggplot( ROC, aes(x=fpr, y=tpr, color=Method) ) + theme_bw() +
    geom_smooth(se=FALSE) + geom_abline( slope=1, linetype="dashed" ) +
    xlab( "False Positive Rate" ) + ylab( "True Positive Rate" ) +
    scale_color_manual( values=c("tomato", "#C3B56B", "salmon4", "darkolivegreen", "steelblue"),
                       guide=FALSE)
ggplotly(gg, width=600) %>%
    add_annotations( text="Method", xref="paper", yref="paper",
                    x=1.02, xanchor="left",
                    y=0.8, yanchor="bottom",    # Same y as legend below
                    legendtitle=TRUE, showarrow=FALSE ) %>%
    layout( legend=list(y=0.8, yanchor="top" ) ) %>%
    htmltools::div( align="center" )

To compute area under the ROC curves, we make use of Equation (3) in Hand and Till, 2001, which does not require an explicit construction of the curves. We encapsulate this equation into an R function and apply it directly to the sample ranking:

auc <- function( pred, lbls )
{
    np <- sum( lbls == "Sensitive" )
    nn <- sum( lbls == "Resistant" )
    s0 <- sum( rank(pred)[lbls == "Sensitive"] )
    (s0 - np*(np+1) / 2) / (np*nn)
}
ROC %>% summarize( AUC = auc( Pred, Label ) )

As mentioned above, we repeated cross-validation 100 times in the paper, and the AUC estimates were averaged across those 100 runs.

Retrieving feature importance scores

Based on the cross-validation results, we choose gradient boosted machines (GBM) as the primary method for subsequent analyses. Before applying the method to make predictions on new data, we may be interested to know what features are being utilized by it. We re-run cross-validation for GBM and isolate the model that is associated with parameter values that yield the best performance. We then interrogate this model for feature imporance scores and plot the top 20.

## Fix the seed to allow vignette reproducibility
set.seed(100)

## Isolate the training data (defined by labels not being NA)
## and apply cross-validation to predict Label from MACCS columns
X1 <- MACCSbinary %>% filter( !is.na(Label) ) %>% select( -Drug, -pubchem_id )
cv <- caret::train( Label ~ ., data=X1, method="gbm", verbose=FALSE,
            trControl = caret::trainControl(method="cv") )
FImp <- summary( cv$finalModel, plot = FALSE ) %>% arrange( desc(rel.inf) ) %>%
    head(20) %>% mutate( Feature = factor(var,var), Score = rel.inf )
ggplot( FImp, aes( x=Feature, y=Score ) ) + theme_bw() +
    geom_bar( fill = "steelblue", alpha=0.8, stat="identity" ) +
    theme( axis.text.x = element_text( angle=90, hjust=1, vjust=0.5 ) )

ArtemSokolov/ABCmonster documentation built on May 25, 2019, 7:16 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ArtemSokolov/ABCmonster
Prediction of ABC-transporter-dependent efficacy using drug chemical structures

In ArtemSokolov/ABCmonster: Prediction of ABC-transporter-dependent efficacy using drug chemical structures

Univariate analysis

Cross-validation

Retrieving feature importance scores

R Package Documentation

Browse R Packages

We want your feedback!

ArtemSokolov/ABCmonster Prediction of ABC-transporter-dependent efficacy using drug chemical structures

In ArtemSokolov/ABCmonster: Prediction of ABC-transporter-dependent efficacy using drug chemical structures

Univariate analysis

Cross-validation

Retrieving feature importance scores

R Package Documentation

Browse R Packages

We want your feedback!

ArtemSokolov/ABCmonster
Prediction of ABC-transporter-dependent efficacy using drug chemical structures