The package contains the training and test datasets described in the
paper. You can access them directly via the data()
function after
loading the package:
library( ABCmonster ) data( MACCSbinary ) ## The data is now available as MACCSbinary
Looking at the first few rows and columns gives the general idea of how the data is structured.
dim( MACCSbinary ) head( MACCSbinary[,1:6] )
The first column, Label
, contains annotations for each drug
specifying whether the drug is more ("Sensitive"
) or less ("Resistant"
)
efficacious in the ABC-16 strain, relative to the parental strain.
Test data is located in the last 24 rows of MACCSbinary
and contains NA
s in the
Label
column:
tail( MACCSbinary[,1:6] )
Drug identity is captured in the second (Drug
) and third
(pubchem_id
) columns. All remaining columns contain molecular
features, which in this case consist of binary calls on whether a
particular MACCS key is present in a drug's structure.
The package also contains SMILES strings for all 400 drugs considered in the paper. These strings are stored as a separate data frame and can also be retrieved using the data()
function.
data( ABCSMILES ) head( ABCSMILES )
Note that we provide both the Isomeric SMILES strings, which preserve stereochemistry information, as well as Canonical SMILES, where this information is stripped. Both sets of SMILES strings were obtained directly from PubChem, using their PUG REST API.
Let's begin by plotting a heatmap that presents and overview of the entire dataset. We will make extensive use of tidy data manipulation. As the first step, we load the appropriate libraries and define a palette for subsequent plots.
library( tidyverse ) library( plotly ) ABCpal <- c("Resistant" = "tomato", "Sensitive" = "steelblue")
Isolate the training data by selecting the set of rows with non-NA
labels. HH
will contain the matrix to be plotted as the heatmap,
while Annot
will store labels that will be displayed as an additional
annotation bar at the top of the heatmap. The plotting is done using
the pheatmap package.
Xtrain <- MACCSbinary %>% filter( !is.na(Label) ) HH <- Xtrain %>% select(-Label, -Drug) %>% column_to_rownames("pubchem_id") %>% as.matrix Annots <- select( Xtrain, pubchem_id, Label ) %>% column_to_rownames( "pubchem_id" ) pheatmap::pheatmap( t(HH), show_rownames=FALSE, show_colnames=FALSE, annotation_col=Annots, annotation_names_col=FALSE, legend=FALSE, color=c("gray90","gray30"), annotation_colors=list(Label=ABCpal) )
The version of this figure in the paper includes additional post-processing consisting of leaf order optimization and grouping of rows.
Lastly, we can perform dimensionality reduction to explore the general trends in the data.
## Compute PCA P <- Xtrain %>% select( -Label, -Drug, -pubchem_id ) %>% prcomp() %>% broom::augment( Xtrain ) %>% select( Label, PC1=.fittedPC1, PC2=.fittedPC2 ) ## Plot the projection onto the first two principal components gg <- ggplot( P, aes(x=PC1, y=PC2, color=Label) ) + geom_point() + theme_bw() + scale_color_manual( values=ABCpal ) ggplotly(gg)
P <- Xtrain %>% select( -Label, -Drug, -pubchem_id ) %>% prcomp() %>% broom::augment( Xtrain ) %>% rename( PC1 = .fittedPC1, PC2 = .fittedPC2 ) gg <- ggplot( P, aes(x=PC1, y=PC2, color=Label, lbl1=Drug, lbl2=pubchem_id) ) + geom_point() + theme_bw() + scale_color_manual( values=ABCpal, guide=FALSE ) ggplotly(gg, tooltip=c("lbl1", "lbl2"), width=600) %>% add_annotations( text="Label", xref="paper", yref="paper", x=1.02, xanchor="left", y=0.8, yanchor="bottom", # Same y as legend below legendtitle=TRUE, showarrow=FALSE ) %>% layout( legend=list(y=0.8, yanchor="top" ) ) %>% htmltools::div( align="center" )
We observe that the unsupervised analysis is unable to distinguish between the class labels. In the next vignette, we will investigate this data in the context of supverised methods.
The package includes three additional feature representations of the same
400 drugs: 1) MACCScount
, which captures the number of occurrences
of each MACCS fingerprint (as opposed to just their presence/absence,
as in MACCSbinary
); 2) Morgan
, which captures ECFP/Morgan
representation; and 3) PChem
, which consists of physico-chemical
properties. All three matrices are in the same format as
MACCSbinary
, including having the Label
column designate class
assignment (Sensitive
or Resistant
) for training data and NA
for
test samples.
We can load and examine the data using the same functions as those outlined in the previous sections. Here's a couple of additional examples:
## Load all three datasets data( MACCScount, Morgan, PChem ) ## Identify the highest number of occurrences of MACCS keys ## List the top keys along with which drugs they occur in MACCScount %>% gather( Key, nOccur, -Label, -Drug, -pubchem_id ) %>% group_by( Key ) %>% filter( nOccur == max(nOccur) ) %>% arrange( desc(nOccur) ) %>% ungroup %>% head ## Count the total number of occurrences of Morgan keys ## Display the most commonly-occurring keys Morgan %>% gather( Key, nOccur, -Label, -Drug, -pubchem_id ) %>% group_by( Key ) %>% summarize( nTotal = sum(nOccur) ) %>% arrange( desc(nTotal) ) %>% head ## We may be interested in normalizing each physicochemical property ## to have zero mean and standard deviation of one PChemNorm <- PChem %>% mutate_at( vars(-Label, -Drug, -pubchem_id), ~(.x - mean(.x))/sd(.x) ) summary( PChemNorm$SlogP ) sd( PChemNorm$SlogP )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.