library(knitr) knitr::opts_chunk$set( # cache = TRUE, dpi = 60, comment = '#>', tidy = FALSE)
Author: Alan O'Callaghan (alan.b.ocallaghan@gmail.com)
This vignette is intended to provide an overview of the use of
heatmaply
in the analysis of biological data. In particular,
this vignette deals with its use to visualize gene expression patterns and sample relationships in RNAseq data relating to breast cancer samples, as retrieved from from Genomic Data Commons.
Due to the size of the objects, this file is seperated to three fils. You can view this series in the following links:
This is file 1 in the series.
# Let's load the packages library(heatmaply) library(heatmaplyExamples)
The data used in this vignette is available within the package.
In order to recreate the output you will need to manually run the code available here:
## This script downloads the expression data for all breast cancer samples ## from GDC/TCGA, and filters them to have only the genes in the ## PAM50 gene set library('TCGAbiolinks') library('SummarizedExperiment') library('biomaRt') library('voom') query <- GDCquery(project = 'TCGA-BRCA', data.category = 'Transcriptome Profiling', data.type = 'Gene Expression Quantification', workflow.type = 'HTSeq - Counts' ) GDCdownload(query) data <- GDCprepare(query) ## Retain only tumour samples ind_tumor <- colData(data)[['definition']]== 'Primary solid Tumor' data <- data[, ind_tumor] ## Annotate using biomaRt mart <- useDataset('hsapiens_gene_ensembl', useMart('ensembl')) genes <- rownames(data) symbols <- getBM(filters= 'ensembl_gene_id', attributes= c('ensembl_gene_id','hgnc_symbol'), values = genes, mart= mart) ## Remove those not annotated ind_nchar <- as.logical(nchar(symbols[['hgnc_symbol']])) symbols <- symbols[ind_nchar, ] data <- data[symbols[['ensembl_gene_id']], ] rownames(data) <- symbols[['hgnc_symbol']] ## Subset to those annotated with a symbol here pam50_genes <- intersect(pam50_genes, symbols[['hgnc_symbol']]) clinical_cols <- c('subtype_Integrated.Clusters..with.PAM50.', 'subtype_ER.Status', 'subtype_PR.Status', 'subtype_HER2.Final.Status' ) ## Only interested in those which have all subtypes. subtypes <- colData(data)[, clinical_cols] ind_has_subtypes <- sapply(seq_len(nrow(subtypes)), function(i) { all(!is.na(subtypes[i, ])) }) data <- data[, ind_has_subtypes] tcga_brca_clinical <- colData(data) tcga_brca_clinical <- tcga_brca_clinical[, clinical_cols] colnames(tcga_brca_clinical) <- gsub('subtype_', '', colnames(tcga_brca_clinical)) stypes <- c('ER.Status', 'PR.Status', 'HER2.Final.Status') tcga_brca_clinical[, stypes] <- lapply(tcga_brca_clinical[, stypes], function(col) { col <- as.character(col) ifelse(col %in% c('Positive', 'Negative'), col, NA) } ) ## Set up final objects tcga_brca_clinical <- as.data.frame(tcga_brca_clinical) expression <- assay(data, 'HTSeq - Counts') expression <- expression[!duplicated(rownames(expression)), ] voomed_expression <- as.matrix(voom(expression)) raw_expression <- expression
The R package limma is a commonly used statistical analysis tool using
empirical Bayes methods. This package was originally developed for micro-array data.
voom is a function within this package which transforms RNAseq count data in
log2 counts per million reads, which allows it to be treated similarly.
This function was used in the present example to provide normalized expression
values. While pre-processing the data, quantile and TMM normalization
were also applied. To compare these normalized values to raw read counts,
the raw read counts were log~2~-transformed. 0.5 was
added to prevent infinite values resulting from log2(0)
.
A common workflow for RNAseq differential expression analysis is to visualize gene expression measures and sample-sample correlations using a heatmap. This is useful to observe whether samples appear to cluster together, for the purposes of identifying poor quality samples or outliers. Furthermore, it may be useful to visualize differentially-expressed genes alongside row annotations which indicate patient or sample sub-group.
The following examples show sample-sample correlation based on all genes measured. Clustering in both instances appears to be related to membrane receptor subtypes, as shown in the row annotation.
When using the heatmaply
function, notice the use of the showticklabels
argument - by turning the labels off, the resulting heatmap is much lighter and allows fast zoom-in. The actual values can still be identified when hovering the mouse over the cells. Also, notice the use of the default color scheme chosen so to emphasize the correlation (which has low variability).
cor_mat_raw_logged <- cor(log2(raw_expression + 0.5)) heatmaply(cor_mat_raw_logged, row_side_colors = tcga_brca_clinical, main = 'Sample-sample correlation, log2 counts', showticklabels = c(FALSE, FALSE), plot_method = 'plotly')
The following gives very similar results:
cor_mat_voomed <- cor(voomed_expression)
heatmaply(cor_mat_voomed, row_side_colors = tcga_brca_clinical, main = 'Sample-sample correlation, log2 CPM', showticklabels = c(FALSE, FALSE), plot_method = 'plotly')
This is since the correlation of the values in the matrix is 1 (while the objects are not identical):
identical(as.vector(cor_mat_voomed) ,as.vector(cor_mat_raw_logged)) cor(as.vector(cor_mat_voomed) ,as.vector(cor_mat_raw_logged))
It may be useful when examining expression
heatmaps to identify particularly high or low measures for a single
gene in a group of patients, or a gene which shows unusually high or low variance.
The mouse-over text available in the heatmaply
package allows visual assessment of measures of interest and quick identification
of samples or genes with unusual gene expression patterns.
Similarly, visualizing correlation heatmaps with heatmaply
allows the user to
rapidly identify samples with unusually high or low pairwise correlation.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.