knitr::opts_chunk$set(collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE, fig.align = 'center', results = 'asis', fig.show = 'hold', fig.width = 7, fig.height = 5)

Some quick examples using the NKI breast cancer data found in the breastCancerNKI package. We'll cover

  1. Simple feature selection by variance
  2. Structuring data for plotting by both MDS and as a heatmap
  3. Feature id conversion for further analysis

But first, what is the nki data? And what does it look like? Originally used as a validation dataset for the 70-gene MammaPrint signature this dataset has grown slightly in size but still is effectively the same collection of invasive breast cancer samples of no specific subtype. More information can be found in the bioconductor package's web page but, because there are issues with installing this on windows I've simply included the data here as part of transcripTools.

As a data object the nki data looks like this

library(transcripTools, quietly = TRUE)
library(Biobase, quietly = TRUE)
data(nki)

We have three slots of relevant information, the assayData, where the expression matrix is found, phenoData where the phenotypic data is found, and featureData where the feature data is found. We can see we have 337 samples and 24481 features to investigate. For the sake of this vignette we'll separate the nki expressionSet into individual pieces, though this is kind of against the fundamental principles of why an expressionSet is useful (which you can read up on here).

xpr <- exprs(nki) 
pheno <- pData(nki)
feats <- fData(nki)

1 Feature selection by variance

On the assumption that the genes that vary the most across our samples are the most biologically informative, we can choose the n most variable reduce the size/complexity of our subsequent investigation. To achieve this we use the mostVar() function, which will take our expression matrix (xpr) and the number of features we want to extract, and return us a subsetted matrix containing the n features we specified.

xpr_500 <- mostVar(xpr, 500)
dim(xpr_500)

 

2 Structuring data for plotting by both MDS and as a heatmap

In both instances we will run the function of choice and then merge the results with our phenotypic data. This will arrange the data for very simple but powerful plotting using ggplot2.

2.1 MDS

For more information on multi-dimensional scaling see here

mds_arg <- mdsArrange(xpr_500, isAlreadyScaled = TRUE) #the nki data is scaled 
                                        #around 0, not all data is like this
mds_mrg <- merge(mds_arg, pheno, by.x = "ids", by.y = "samplename")

head(mds_mrg)
mds_arg <- mdsArrange(xpr_500, isAlreadyScaled = TRUE) #the nki data is scaled 
                                        #around 0, not all data is like this
mds_mrg <- merge(mds_arg, pheno, by.x = "ids", by.y = "samplename")

knitr::kable(head(mds_mrg))

With our data ready to plot we simply call ggplot and specify our colours etc. using mds_mrg's column headings. Unsurprisingly, when we colour by er (oestrogen receptor status) we see that the 500 most variable genes capture and separate our samples by this factor.

library(ggplot2, quietly = TRUE)

ggplot(mds_mrg, aes(x = x, y = y, colour = as.factor(er))) + 
    geom_point() +
    labs(x = "PC1", y = "PC2", colour = "ER-status",
        title = "NKI MDS")

2.2 Heatmaps

This process is pretty much the same as for mdsArrange, though heatmapArrange has a couple of extra arguments that allow for hierarchical clustering and scaling. Again, here's a little info on heatmaps and hierarchical clustering

hm_arg <- heatmapArrange(xpr_500, 
                        cluster_row = TRUE,
                        cluster_col = TRUE,
                        scale = TRUE,
                        by_row = TRUE) 

heatmapArrange handily reminds us of the colour function to hand to ggplot to get a green-red (bad! not good for colour blindness!) heatmap.

hm_mrg <- merge(hm_arg, pheno, by.x = "col_var", by.y = "samplename")

head(hm_mrg)
hm_mrg <- merge(hm_arg, pheno, by.x = "col_var", by.y = "samplename")

knitr::kable(head(hm_mrg))
p_hmap <- ggplot(hm_mrg, aes(x = col_var, y = row_var, fill = value)) + 
    geom_tile() +
    scale_fill_gradient2(high = "#d73027", mid = "black", low = "#1a9850") + 
    labs(title = "NKI heatmap") +
    theme_void()

Heatmaps produced this can be combined with a colour bar using the cowplot package's plot_grid().

p_bar <- ggplot(hm_mrg, aes(x = col_var, y = "ER", fill = as.factor(er))) +
    geom_tile() + 
    theme_void()

# cowplot annoying overrides the default theme so we combine
# calling it with resetting the default theme
library(cowplot); theme_set(theme_grey()) 

plot_grid(p_hmap, p_bar, ncol = 1, rel_heights = c(15, 1), align = 'v')

 

3 Feature id conversion for further analysis

Finally, our features are currently in the refseq id format and perhaps we want HGNC official gene symbols instead. We can use two functions to achieve this, idReplace() that will convert and replace the ids in the expression matrix, or idConvert() that will simply convert a vector of ids. Here we'll use idConvert(), the the arguments and usage are highly similar between the two functions.

mv_10 <- row.names(xpr_500)[1:10]

idConvert(mv_10, format_in = "refseq_mrna", format_out = "hgnc_symbol")
mv_10 <- row.names(xpr_500)[1:10]

knitr::kable(idConvert(mv_10, format_in = "refseq_mrna", format_out = "hgnc_symbol"))

 



abc-igmm/transcripTools documentation built on May 20, 2019, 3:05 p.m.