Accessing the data

The package contains the training and test datasets described in the paper. You can access them directly via the data() function after loading the package:

library( ABCmonster )
data( MACCSbinary )
## The data is now available as MACCSbinary

Looking at the first few rows and columns gives the general idea of how the data is structured.

dim( MACCSbinary )
head( MACCSbinary[,1:6] )

The first column, Label, contains annotations for each drug specifying whether the drug is more ("Sensitive") or less ("Resistant") efficacious in the ABC-16 strain, relative to the parental strain. Test data is located in the last 24 rows of MACCSbinary and contains NAs in the Label column:

tail( MACCSbinary[,1:6] )

Drug identity is captured in the second (Drug) and third (pubchem_id) columns. All remaining columns contain molecular features, which in this case consist of binary calls on whether a particular MACCS key is present in a drug's structure.

SMILES representations of drugs

The package also contains SMILES strings for all 400 drugs considered in the paper. These strings are stored as a separate data frame and can also be retrieved using the data() function.

data( ABCSMILES )
head( ABCSMILES )

Note that we provide both the Isomeric SMILES strings, which preserve stereochemistry information, as well as Canonical SMILES, where this information is stripped. Both sets of SMILES strings were obtained directly from PubChem, using their PUG REST API.

Visualizing the data

Let's begin by plotting a heatmap that presents and overview of the entire dataset. We will make extensive use of tidy data manipulation. As the first step, we load the appropriate libraries and define a palette for subsequent plots.

library( tidyverse )
library( plotly )
ABCpal <- c("Resistant" = "tomato", "Sensitive" = "steelblue")

Isolate the training data by selecting the set of rows with non-NA labels. HH will contain the matrix to be plotted as the heatmap, while Annot will store labels that will be displayed as an additional annotation bar at the top of the heatmap. The plotting is done using the pheatmap package.

Xtrain <- MACCSbinary %>% filter( !is.na(Label) )
HH <- Xtrain %>% select(-Label, -Drug) %>% column_to_rownames("pubchem_id") %>% as.matrix
Annots <- select( Xtrain, pubchem_id, Label ) %>% column_to_rownames( "pubchem_id" )
pheatmap::pheatmap( t(HH), show_rownames=FALSE, show_colnames=FALSE,
         annotation_col=Annots, annotation_names_col=FALSE, legend=FALSE,
         color=c("gray90","gray30"), annotation_colors=list(Label=ABCpal) )

The version of this figure in the paper includes additional post-processing consisting of leaf order optimization and grouping of rows.

Dimensionality reduction

Lastly, we can perform dimensionality reduction to explore the general trends in the data.

## Compute PCA
P <- Xtrain %>% select( -Label, -Drug, -pubchem_id ) %>% prcomp() %>%
    broom::augment( Xtrain ) %>% select( Label, PC1=.fittedPC1, PC2=.fittedPC2 )

## Plot the projection onto the first two principal components
gg <- ggplot( P, aes(x=PC1, y=PC2, color=Label) ) + geom_point() +
    theme_bw() + scale_color_manual( values=ABCpal )
ggplotly(gg)
P <- Xtrain %>% select( -Label, -Drug, -pubchem_id ) %>% prcomp() %>%
    broom::augment( Xtrain ) %>% rename( PC1 = .fittedPC1, PC2 = .fittedPC2 )
gg <- ggplot( P, aes(x=PC1, y=PC2, color=Label, lbl1=Drug, lbl2=pubchem_id) ) +
    geom_point() + theme_bw() + scale_color_manual( values=ABCpal, guide=FALSE )
ggplotly(gg, tooltip=c("lbl1", "lbl2"), width=600) %>%
    add_annotations( text="Label", xref="paper", yref="paper",
                    x=1.02, xanchor="left",
                    y=0.8, yanchor="bottom",    # Same y as legend below
                    legendtitle=TRUE, showarrow=FALSE ) %>%
    layout( legend=list(y=0.8, yanchor="top" ) ) %>%
    htmltools::div( align="center" )

We observe that the unsupervised analysis is unable to distinguish between the class labels. In the next vignette, we will investigate this data in the context of supverised methods.

Additional Feature Spaces

The package includes three additional feature representations of the same 400 drugs: 1) MACCScount, which captures the number of occurrences of each MACCS fingerprint (as opposed to just their presence/absence, as in MACCSbinary); 2) Morgan, which captures ECFP/Morgan representation; and 3) PChem, which consists of physico-chemical properties. All three matrices are in the same format as MACCSbinary, including having the Label column designate class assignment (Sensitive or Resistant) for training data and NA for test samples.

We can load and examine the data using the same functions as those outlined in the previous sections. Here's a couple of additional examples:

## Load all three datasets
data( MACCScount, Morgan, PChem )

## Identify the highest number of occurrences of MACCS keys
## List the top keys along with which drugs they occur in
MACCScount %>% gather( Key, nOccur, -Label, -Drug, -pubchem_id ) %>% group_by( Key ) %>%
    filter( nOccur == max(nOccur) ) %>% arrange( desc(nOccur) ) %>% ungroup %>% head

## Count the total number of occurrences of Morgan keys
## Display the most commonly-occurring keys
Morgan %>% gather( Key, nOccur, -Label, -Drug, -pubchem_id ) %>% group_by( Key ) %>%
    summarize( nTotal = sum(nOccur) ) %>% arrange( desc(nTotal) ) %>% head

## We may be interested in normalizing each physicochemical property
##   to have zero mean and standard deviation of one
PChemNorm <- PChem %>% mutate_at( vars(-Label, -Drug, -pubchem_id), ~(.x - mean(.x))/sd(.x) )
summary( PChemNorm$SlogP )
sd( PChemNorm$SlogP )


ArtemSokolov/ABCmonster documentation built on May 25, 2019, 7:16 a.m.