In talgalili/heatmaplyExamples: 'Heatmaply' Examples

library(knitr)
knitr::opts_chunk$set(
   # cache = TRUE,
   dpi = 60,
  comment = '#>',
  tidy = FALSE)

Author: Alan O'Callaghan (alan.b.ocallaghan@gmail.com)

Introduction

This vignette is intended to provide an overview of the use of heatmaply in the analysis of biological data. In particular, this vignette deals with its use to visualize gene expression patterns and sample relationships in RNAseq data relating to breast cancer samples, as retrieved from from Genomic Data Commons.

Due to the size of the objects, this file is seperated to three fils. You can view this series in the following links:

This is file 1 in the series.

# Let's load the packages
library(heatmaply)
library(heatmaplyExamples)

Data retrieval and pre-processing

The data used in this vignette is available within the package.

In order to recreate the output you will need to manually run the code available here:

## This script downloads the expression data for all breast cancer samples
## from GDC/TCGA, and filters them to have only the genes in the 
## PAM50 gene set

library('TCGAbiolinks')
library('SummarizedExperiment')
library('biomaRt')
library('voom')


query <- GDCquery(project = 'TCGA-BRCA',
    data.category = 'Transcriptome Profiling',
    data.type = 'Gene Expression Quantification',
    workflow.type = 'HTSeq - Counts'
)

GDCdownload(query)
data <- GDCprepare(query)

## Retain only tumour samples
ind_tumor <- colData(data)[['definition']]== 'Primary solid Tumor'
data <- data[, ind_tumor]

## Annotate using biomaRt
mart <- useDataset('hsapiens_gene_ensembl', useMart('ensembl'))
genes <- rownames(data)
symbols <- getBM(filters= 'ensembl_gene_id', 
    attributes= c('ensembl_gene_id','hgnc_symbol'), 
    values = genes, 
    mart= mart)

## Remove those not annotated
ind_nchar <- as.logical(nchar(symbols[['hgnc_symbol']]))
symbols <- symbols[ind_nchar, ]
data <- data[symbols[['ensembl_gene_id']], ]
rownames(data) <- symbols[['hgnc_symbol']]

## Subset to those annotated with a symbol here
pam50_genes <- intersect(pam50_genes, symbols[['hgnc_symbol']])



clinical_cols <- c('subtype_Integrated.Clusters..with.PAM50.',
    'subtype_ER.Status',
    'subtype_PR.Status',
    'subtype_HER2.Final.Status'
)

## Only interested in those which have all subtypes.
subtypes <- colData(data)[, clinical_cols]
ind_has_subtypes <- sapply(seq_len(nrow(subtypes)), 
    function(i) {
        all(!is.na(subtypes[i, ]))
    })
data <- data[, ind_has_subtypes]


tcga_brca_clinical <- colData(data)
tcga_brca_clinical <- tcga_brca_clinical[, clinical_cols]
colnames(tcga_brca_clinical) <- gsub('subtype_', '', colnames(tcga_brca_clinical))

stypes <- c('ER.Status', 'PR.Status', 'HER2.Final.Status')

tcga_brca_clinical[, stypes] <- lapply(tcga_brca_clinical[, stypes], 
    function(col) {
        col <- as.character(col)
        ifelse(col %in% c('Positive', 'Negative'), col, NA)
    }
)

## Set up final objects
tcga_brca_clinical <- as.data.frame(tcga_brca_clinical)
expression <- assay(data, 'HTSeq - Counts')
expression <- expression[!duplicated(rownames(expression)), ]

voomed_expression <- as.matrix(voom(expression))
raw_expression <- expression

Visualization of raw and voom-transformed data (all genes)

The R package limma is a commonly used statistical analysis tool using empirical Bayes methods. This package was originally developed for micro-array data. voom is a function within this package which transforms RNAseq count data in log2 counts per million reads, which allows it to be treated similarly. This function was used in the present example to provide normalized expression values. While pre-processing the data, quantile and TMM normalization were also applied. To compare these normalized values to raw read counts, the raw read counts were log~2~-transformed. 0.5 was added to prevent infinite values resulting from log2(0).

A common workflow for RNAseq differential expression analysis is to visualize gene expression measures and sample-sample correlations using a heatmap. This is useful to observe whether samples appear to cluster together, for the purposes of identifying poor quality samples or outliers. Furthermore, it may be useful to visualize differentially-expressed genes alongside row annotations which indicate patient or sample sub-group.

The following examples show sample-sample correlation based on all genes measured. Clustering in both instances appears to be related to membrane receptor subtypes, as shown in the row annotation.

When using the heatmaply function, notice the use of the showticklabels argument - by turning the labels off, the resulting heatmap is much lighter and allows fast zoom-in. The actual values can still be identified when hovering the mouse over the cells. Also, notice the use of the default color scheme chosen so to emphasize the correlation (which has low variability).

cor_mat_raw_logged <- cor(log2(raw_expression + 0.5))

heatmaply(cor_mat_raw_logged, 
    row_side_colors = tcga_brca_clinical,
    main = 'Sample-sample correlation, log2 counts',
    showticklabels = c(FALSE, FALSE),
    plot_method = 'plotly')

The following gives very similar results:

cor_mat_voomed <- cor(voomed_expression)

heatmaply(cor_mat_voomed, 
    row_side_colors = tcga_brca_clinical,
    main = 'Sample-sample correlation, log2 CPM',
    showticklabels = c(FALSE, FALSE),
    plot_method = 'plotly')

This is since the correlation of the values in the matrix is 1 (while the objects are not identical):

identical(as.vector(cor_mat_voomed) ,as.vector(cor_mat_raw_logged))
cor(as.vector(cor_mat_voomed) ,as.vector(cor_mat_raw_logged))

Discussion

It may be useful when examining expression heatmaps to identify particularly high or low measures for a single gene in a group of patients, or a gene which shows unusually high or low variance. The mouse-over text available in the heatmaply package allows visual assessment of measures of interest and quick identification of samples or genes with unusual gene expression patterns. Similarly, visualizing correlation heatmaps with heatmaply allows the user to rapidly identify samples with unusually high or low pairwise correlation.