In abc-igmm/transcripTools: Collection of functions for analysing gene expression data

knitr::opts_chunk$set(collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE, fig.align = 'center', results = 'asis', fig.show = 'hold', fig.width = 7, fig.height = 5)

Some quick examples using the NKI breast cancer data found in the `breastCancerNKI` package. We'll cover

Simple feature selection by variance
Structuring data for plotting by both MDS and as a heatmap
Feature id conversion for further analysis

But first, what is the nki data? And what does it look like? Originally used as a validation dataset for the 70-gene MammaPrint signature this dataset has grown slightly in size but still is effectively the same collection of invasive breast cancer samples of no specific subtype. More information can be found in the bioconductor package's web page but, because there are issues with installing this on windows I've simply included the data here as part of `transcripTools`.

As a data object the nki data looks like this

library(transcripTools, quietly = TRUE)
library(Biobase, quietly = TRUE)
data(nki)

We have three slots of relevant information, the `assayData`, where the expression matrix is found, `phenoData` where the phenotypic data is found, and `featureData` where the feature data is found. We can see we have 337 samples and 24481 features to investigate. For the sake of this vignette we'll separate the nki expressionSet into individual pieces, though this is kind of against the fundamental principles of why an expressionSet is useful (which you can read up on here).

xpr <- exprs(nki) 
pheno <- pData(nki)
feats <- fData(nki)

1 Feature selection by variance

On the assumption that the genes that vary the most across our samples are the most biologically informative, we can choose the `n` most variable reduce the size/complexity of our subsequent investigation. To achieve this we use the `mostVar()` function, which will take our expression matrix (`xpr`) and the number of features we want to extract, and return us a subsetted matrix containing the `n` features we specified.

xpr_500 <- mostVar(xpr, 500)
dim(xpr_500)

2 Structuring data for plotting by both MDS and as a heatmap

In both instances we will run the function of choice and then merge the results with our phenotypic data. This will arrange the data for very simple but powerful plotting using ggplot2.

2.1 MDS

For more information on multi-dimensional scaling see here

mds_arg <- mdsArrange(xpr_500, isAlreadyScaled = TRUE) #the nki data is scaled 
                                        #around 0, not all data is like this
mds_mrg <- merge(mds_arg, pheno, by.x = "ids", by.y = "samplename")

head(mds_mrg)

mds_arg <- mdsArrange(xpr_500, isAlreadyScaled = TRUE) #the nki data is scaled 
                                        #around 0, not all data is like this
mds_mrg <- merge(mds_arg, pheno, by.x = "ids", by.y = "samplename")

knitr::kable(head(mds_mrg))

With our data ready to plot we simply call `ggplot` and specify our colours etc. using `mds_mrg`'s column headings. Unsurprisingly, when we colour by `er` (oestrogen receptor status) we see that the 500 most variable genes capture and separate our samples by this factor.

library(ggplot2, quietly = TRUE)

ggplot(mds_mrg, aes(x = x, y = y, colour = as.factor(er))) + 
    geom_point() +
    labs(x = "PC1", y = "PC2", colour = "ER-status",
        title = "NKI MDS")

2.2 Heatmaps

This process is pretty much the same as for `mdsArrange`, though `heatmapArrange` has a couple of extra arguments that allow for hierarchical clustering and scaling. Again, here's a little info on heatmaps and hierarchical clustering

hm_arg <- heatmapArrange(xpr_500, 
                        cluster_row = TRUE,
                        cluster_col = TRUE,
                        scale = TRUE,
                        by_row = TRUE)

`heatmapArrange` handily reminds us of the colour function to hand to ggplot to get a green-red (bad! not good for colour blindness!) heatmap.

hm_mrg <- merge(hm_arg, pheno, by.x = "col_var", by.y = "samplename")

head(hm_mrg)

hm_mrg <- merge(hm_arg, pheno, by.x = "col_var", by.y = "samplename")

knitr::kable(head(hm_mrg))

p_hmap <- ggplot(hm_mrg, aes(x = col_var, y = row_var, fill = value)) + 
    geom_tile() +
    scale_fill_gradient2(high = "#d73027", mid = "black", low = "#1a9850") + 
    labs(title = "NKI heatmap") +
    theme_void()

Heatmaps produced this can be combined with a colour bar using the cowplot package's `plot_grid()`.

p_bar <- ggplot(hm_mrg, aes(x = col_var, y = "ER", fill = as.factor(er))) +
    geom_tile() + 
    theme_void()

# cowplot annoying overrides the default theme so we combine
# calling it with resetting the default theme
library(cowplot); theme_set(theme_grey()) 

plot_grid(p_hmap, p_bar, ncol = 1, rel_heights = c(15, 1), align = 'v')

3 Feature id conversion for further analysis

Finally, our features are currently in the refseq id format and perhaps we want HGNC official gene symbols instead. We can use two functions to achieve this, `idReplace()` that will convert and replace the ids in the expression matrix, or `idConvert()` that will simply convert a vector of ids. Here we'll use `idConvert()`, the the arguments and usage are highly similar between the two functions.

mv_10 <- row.names(xpr_500)[1:10]

idConvert(mv_10, format_in = "refseq_mrna", format_out = "hgnc_symbol")

mv_10 <- row.names(xpr_500)[1:10]

knitr::kable(idConvert(mv_10, format_in = "refseq_mrna", format_out = "hgnc_symbol"))

abc-igmm/transcripTools documentation built on May 20, 2019, 3:05 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

abc-igmm/transcripTools
Collection of functions for analysing gene expression data

In abc-igmm/transcripTools: Collection of functions for analysing gene expression data

Some quick examples using the NKI breast cancer data found in the `breastCancerNKI` package. We'll cover

As a data object the nki data looks like this

1 Feature selection by variance

2 Structuring data for plotting by both MDS and as a heatmap

In both instances we will run the function of choice and then merge the results with our phenotypic data. This will arrange the data for very simple but powerful plotting using ggplot2.

2.1 MDS

For more information on multi-dimensional scaling see here

With our data ready to plot we simply call `ggplot` and specify our colours etc. using `mds_mrg`'s column headings. Unsurprisingly, when we colour by `er` (oestrogen receptor status) we see that the 500 most variable genes capture and separate our samples by this factor.

2.2 Heatmaps

This process is pretty much the same as for `mdsArrange`, though `heatmapArrange` has a couple of extra arguments that allow for hierarchical clustering and scaling. Again, here's a little info on heatmaps and hierarchical clustering

`heatmapArrange` handily reminds us of the colour function to hand to ggplot to get a green-red (bad! not good for colour blindness!) heatmap.

Heatmaps produced this can be combined with a colour bar using the cowplot package's `plot_grid()`.

3 Feature id conversion for further analysis

R Package Documentation

Browse R Packages

We want your feedback!

abc-igmm/transcripTools Collection of functions for analysing gene expression data

In abc-igmm/transcripTools: Collection of functions for analysing gene expression data

Some quick examples using the NKI breast cancer data found in the breastCancerNKI package. We'll cover

As a data object the nki data looks like this

1 Feature selection by variance

2 Structuring data for plotting by both MDS and as a heatmap

In both instances we will run the function of choice and then merge the results with our phenotypic data. This will arrange the data for very simple but powerful plotting using ggplot2.

2.1 MDS

For more information on multi-dimensional scaling see here

With our data ready to plot we simply call ggplot and specify our colours etc. using mds_mrg's column headings. Unsurprisingly, when we colour by er (oestrogen receptor status) we see that the 500 most variable genes capture and separate our samples by this factor.

2.2 Heatmaps

This process is pretty much the same as for mdsArrange, though heatmapArrange has a couple of extra arguments that allow for hierarchical clustering and scaling. Again, here's a little info on heatmaps and hierarchical clustering

heatmapArrange handily reminds us of the colour function to hand to ggplot to get a green-red (bad! not good for colour blindness!) heatmap.

Heatmaps produced this can be combined with a colour bar using the cowplot package's plot_grid().

3 Feature id conversion for further analysis

R Package Documentation

Browse R Packages

We want your feedback!

abc-igmm/transcripTools
Collection of functions for analysing gene expression data

Some quick examples using the NKI breast cancer data found in the `breastCancerNKI` package. We'll cover

With our data ready to plot we simply call `ggplot` and specify our colours etc. using `mds_mrg`'s column headings. Unsurprisingly, when we colour by `er` (oestrogen receptor status) we see that the 500 most variable genes capture and separate our samples by this factor.

This process is pretty much the same as for `mdsArrange`, though `heatmapArrange` has a couple of extra arguments that allow for hierarchical clustering and scaling. Again, here's a little info on heatmaps and hierarchical clustering

`heatmapArrange` handily reminds us of the colour function to hand to ggplot to get a green-red (bad! not good for colour blindness!) heatmap.

Heatmaps produced this can be combined with a colour bar using the cowplot package's `plot_grid()`.