This document explains the functionalities available in the esetVis package.
This package contains wrapper functions for three types of visualization: spectral map[@spm], tsne[@tsne] and linear discriminant analysis[@ldaFisher] for data contained in a expressionSet Bioconductor (or an SummarizedExperiment) object. The visualizations are available in two types: static, via the ggplot2 package or interactive, via the ggvis or rbokeh packages.
library(knitr) opts_chunk$set(echo = TRUE, results = 'asis', warning = FALSE, error = FALSE, message = FALSE, cache = FALSE, fig.width = 8, fig.height = 7, #out.width = '0.7\\textwidth', fig.align = 'center')#, out.height = 0.5\\textwidth', fig.path = 'graphs/') options(width = 170)#, stringsAsFactors = FALSE options(warn = 1)#instead of warn = 0 by default -> to have place where warnings occur in the call to Sweave function
library(esetVis) # library(xtable) library(pander)
The ALL dataset contains microarray results from 128 patients with
acute lymphoblastic leukemia (ALL). The data is contained in a
Bioconductor ExpressionSet object. Extra gene annotation is
added to the object, via the annotation package hgu95av2.db.
library(ALL) data(ALL) # to get gene annotation from probe IDs (from the paper HGU95aV2 gene chip was used) library("hgu95av2.db") library("AnnotationDbi") probeIDs <- featureNames(ALL) geneInfo <- AnnotationDbi::select(hgu95av2.db, probeIDs, c("ENTREZID", "SYMBOL", "GENENAME"), "PROBEID") # 482 on the 12625 probe IDs don't have ENTREZ ID/SYMBOL/GENENAME # remove genes with duplicated annotation: 1214 geneInfoWthtDuplicates <- geneInfo[!duplicated(geneInfo$PROBEID), ] # remove genes without annotation: 482 genesWthtAnnotation <- rowSums(is.na(geneInfoWthtDuplicates)) > 0 geneInfoWthtDuplicatesAndWithAnnotation <- geneInfoWthtDuplicates[!genesWthtAnnotation, ] probeIDsWithAnnotation <- featureNames(ALL)[featureNames(ALL) %in% geneInfoWthtDuplicatesAndWithAnnotation$PROBEID] ALL <- ALL[probeIDsWithAnnotation, ] fData <- geneInfoWthtDuplicatesAndWithAnnotation[ match(probeIDsWithAnnotation, geneInfoWthtDuplicatesAndWithAnnotation$PROBEID), ] rownames(fData) <- probeIDsWithAnnotation fData(ALL) <- fData # grouping variable: B = B-cell, T = T-cell groupingVariable <- pData(ALL)$BT # create custom palette colorPalette <- c("dodgerblue", colorRampPalette(c("white","dodgerblue2", "darkblue"))(5)[-1], "red", colorRampPalette(c("white", "red3", "darkred"))(5)[-1]) names(colorPalette) <- levels(groupingVariable) color <- groupingVariable; levels(color) <- colorPalette # reformat type of the remission remissionType <- ifelse(is.na(ALL$remission), "unknown", as.character(ALL$remission)) ALL$remissionType <- factor(remissionType, levels = c("unknown", "CR", "REF"), labels = c("unknown", "achieved", "refractory"))
fvarMetadata(ALL)$labelDescription <- paste(rownames(fvarMetadata(ALL)), "obtained from the Bioconductor hgu95av2.db package")
rm(list = c("fData", ls(pattern = "^(geneInfo|probeIDs)")));tmp <- gc(verbose = FALSE) #for testing: sort(sapply(ls(), function(x) object.size(get(x))))
Following tables detail some sample and gene annotation contained in the ALL
ExpressionSet used in the vignette.
varLabelsUsed <- c("cod", "sex", "age", "BT", "remissionType") pander(head(pData(ALL)[, varLabelsUsed]), caption = "subset of the **phenoData** of the ALL dataset for the first genes") pander(head(fData(ALL)), caption = "**featureData** of the ALL dataset for the first genes")
The functions of the package also supports object
of class:
SummarizedExperiment.
Note: In this case, the functions fData, pData, exprs should be replaced
by their corresponding functions rowData, colData and assay.
esetSpectralMapThe function esetSpectralMap creates a spectral
map[@spm] for the dataset. Some default parameters are set, e.g.
to print the top 10 samples and top 10 genes, to display the first two
dimensions of the analysis...
The resulting biplot contains two components:
Here is an example for the ALL dataset, with the default parameters.
print(esetSpectralMap(eset = ALL))
Several annotation variables are available in the eSet.
Four different aesthetics [aes] can be used to display these variables:
coloralphasizeshapeFor each of this aesthetic [aes], several parameters are available:
[aes]Var: name of the column of the phenoData of the
eSet used for the aesthetic, i.e. colorVar[aes]: palette/specified shape/size used for the aesthetic, i.e.
colorCustom size and the transparency (variables sizeVar and alphaVar) can be
specified:
sizeRange and
alphaRangesize
and alpha argumentsIn the example, the type and stage of the disease (variable
BT) is used for coloring, the remission type for the transparency, the
sex for the shape and the age for the size of the points. A custom color
palette is specified via the color argument.
print(esetSpectralMap(eset = ALL, title = "Acute lymphoblastic leukemia dataset \n Spectral map \n Sample annotation (1)", colorVar = "BT", color = colorPalette, shapeVar = "sex", sizeVar = "age", sizeRange = c(0.1, 6), alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9), topSamples = 0, topGenes = 0, cloudGenes = FALSE))
Just for the demonstration, another visualization of the same dataset,
using this time a continuous variable: age for coloring and transparency, a
factor for the size and BT for the shape.
Because the output is a ggplot object, any additional customization
not implemented via specific parameters, the plot can be modified
with additional ggplot2 functions, e.g. with the color of the gradient.
gg <- esetSpectralMap(eset = ALL, title = "Acute lymphoblastic leukemia dataset \n Spectral map \n Sample annotation (2)", colorVar = "age", shapeVar = "BT", shape = 15+1:nlevels(ALL$BT), sizeVar = "remissionType", size = c(2, 4, 6), alphaVar = "age", alphaRange = c(0.2, 0.8), topSamples = 0, topGenes = 0, cloudGenes = TRUE) # change color of gradient library(ggplot2) gg <- gg + scale_color_gradientn(colours = colorRampPalette(c("darkgreen", "gold"))(2) ) print(gg)
Several parameters related to the gene visualization are available:
psidscloudGenescloudGenesNBinscloudGenesColorcloudGenesIncludeLegendcloudGenesTitleLegendThe spectral map is done only on the subset of the genes, with the number of bins, color, and legend title modified.
print(esetSpectralMap(eset = ALL, psids = 1:1000, title = "Acute lymphoblastic leukemia dataset \n Spectral map \n Custom cloud genes", topSamples = 0, topGenes = 0, cloudGenes = TRUE, cloudGenesColor = "red", cloudGenesNBins = 50, cloudGenesIncludeLegend = TRUE, cloudGenesTitleLegend = "number of features"))
Three different kind of elements can be labelled in the plot: genes, samples and gene sets.
For each [element], several parameters are available:
top[Elements]: number of elements to labeltop[Elements]Var (not available for gene sets): column of the
corresponding element in the eSet used for labelling, in
phenoData for sample and featureData for gene. If not
specified, the feature/sample names of the eSet are usedtop[Elements]Just: label justificationtop[Elements]Cex: label sizetop[Elements]Color: label colorThe distance (sum of squared coordinates) of the gene/sample/gene set to the origin of the plot is used to rank the elements, and to extract the top 'outlying' ones.
By default (and if installed), the package
ggrepel
is used for text labelling (as in this vignette), to avoid overlapping labels.
The text labels can also be displayed with the
ggplot2
by setting the parameter packageTextLabel to ggplot2 (default
ggrepel).
In the example, the top genes are labelled with gene symbols (SYMBOL column of the phenoData of the eSet), and the top samples with patient identifiers (cod column of the phenoData of the eSet).
print(esetSpectralMap(eset = ALL, title = paste("Acute lymphoblastic leukemia dataset \n", "Spectral map \n Label outlying samples and genes"), colorVar = "BT", color = colorPalette, shapeVar = "sex", sizeVar = "age", sizeRange = c(2, 6), alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9), topGenes = 10, topGenesVar = "SYMBOL", topGenesCex = 2, topGenesColor = "darkgrey", topSamples = 15, topSamplesVar = "cod", topSamplesColor = "chocolate4", topSamplesCex = 3 ))
Genes can be grouped into biologically meaningful gene sets, which can be labelled in the biplot.
Compared to previous section, two additional parameters are available:
geneSets (required): list of
gene sets. Each element in the list should contain genes identifiers, and the
list should be namedgeneSetsVar: column of the featureData of the eSet used to map the gene
identifiers contained in the geneSets objectgeneSetsMaxNChar: number of characters used in gene sets labelsThe geneSets can be created with the getGeneSetsForPlot
function (wrapper around the getGeneSets function of the
MLP
package), which can extract gene sets available in the Gene Ontology (Biological
Process, Molecular Function and Cellular Component) and KEGG databases. Custom
gene set lists can also be provided.
In the following example, only the pathways from the GOBP database are used.
geneSets <- getGeneSetsForPlot( entrezIdentifiers = fData(ALL)$ENTREZID, species = "Human", geneSetSource = 'GOBP', useDescription = TRUE)
Then this list of gene sets is provided to the esetSpectralMap
function.
print(esetSpectralMap(eset = ALL, title = "Acute lymphoblastic leukemia dataset \n Spectral map \n Gene set annotation", colorVar = "BT", color = colorPalette, shapeVar = "sex", sizeVar = "age", sizeRange = c(2, 6), alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9), topGenes = 0, topGeneSets = 4, geneSets = geneSets, geneSetsVar = "ENTREZID", geneSetsMaxNChar = 30, topGeneSetsCex = 2, topGeneSetsColor = "purple"))
Note: because of the inherent hierarchical structure of the Gene Ontology database, sets of genes can be very similar, which can result in close (even overlapping) labels in the visualization.
rm(geneSets);tmp <- gc(verbose = FALSE)
In all previous plots, only the first dimensions of the principal component
analysis were visualized, this can be specified via the dim parameter.
The third and fourth dimensions are visualized in the next Figure. This
parameter is only available for the spectral map visualization.
print(esetSpectralMap(eset = ALL, title = "Acute lymphoblastic leukemia dataset \n Spectral map \n Dimensions of the PCA", colorVar = "BT", color = colorPalette, shapeVar = "sex", sizeVar = "age", sizeRange = c(0.5, 4), alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9), dim = c(3, 4)))
The function uses the mpm and plot.mpm function from the
mpm package. Some default parameters are set for these two functions,
they can be changed via the mpm.args and plot.mpm.args
arguments. For further details, refer to the documentation of these two
functions.
Note: One important argument is logtrans in mpm
function, which indicates if the data should be first log-transformed
before the computation. It is set by default to FALSE, assuming that the
data are already in the log scale (it is the case for the ALL
dataset).
All plots available in the esetVis package can be static or
interactive.
The argument typePlot can be set respectively to
static (by default), in this case ggplot2 is used, or
interactive, in this case either the ggvis or
rbokeh package is used.
Two functionalities of the interactive plot can be of interest for such high-dimensional data:
phenoData) displayed in the hoover can be specified via the
interactiveTooltipExtraVars parameter.The same spectral map than in previous section is used, this time in an interactive plot.
# wrapper for rbokeh/ggvis example esetSPMInteractive <- function(...) esetSpectralMap(eset = ALL, title = paste("Acute lymphoblastic leukemia dataset - spectral map"), colorVar = "BT", color = unname(colorPalette), shapeVar = "sex", alphaVar = "remissionType", typePlot = "interactive", ... )
esetSPMInteractive( packageInteractivity = "rbokeh", figInteractiveSize = c(700, 600), size = 6, # use all phenoData variables for hoover interactiveTooltipExtraVars = varLabels(ALL) )
Note: as ggvis plot requires to have a R session running, only a static
version of the plot is included.
# embed a static version of the plot library(ggvis) esetSPMInteractive( packageInteractivity = "ggvis", sizeVar = "age", sizeRange = c(2, 6), figInteractiveSize = c(700, 600) )
esetTsneAnother unsupervised visualization is available in the package:
t-Distributed Stochastic Neighbor Embedding (tsne). The
function esetTnse uses the Rtsne function of the same
package.
Most of the previous parameters discussed for the spectral map are available for this visualization, at the exception of:
Rtsne functiondimArguments to the Rtsne function can be specified via the
Rtsne.args argument.
Here is an example of tsne, for the same dataset and same annotation/labelling.
print(esetTsne(eset = ALL, title = "Acute lymphoblastic leukemia dataset \n Tsne", colorVar = "BT", color = colorPalette, shapeVar = "sex", sizeVar = "age", sizeRange = c(2, 6), alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9), topSamplesVar = "cod" ))
The tsne can be quite time-consuming, especially for big data. As the
Rtsne function used in the background can also uses an object of class
dist, the data can be pre-transformed before running the tsne
analysis. The argument fctTransformDataForInputTsne enables to specify
a custom function to pre-transform the data.
print(esetTsne(eset = ALL, title = "Acute lymphoblastic leukemia dataset \n Tsne", colorVar = "BT", color = colorPalette, shapeVar = "sex", sizeVar = "age", sizeRange = c(2, 6), alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9), topSamplesVar = "cod", fctTransformDataForInputTsne = function(mat) as.dist((1 - cor(mat))/2) ))
esetLdaAnother visualization, this time semi-supervised (as a variable is used for the
computation), is included: Linear Discriminant Analysis. This uses the
lda function from the MASS package.
This function maximizes the variance between levels of a factor, here describing
some sample annotation, specified via the ldaVar argument.
As this analysis can be quite time consuming, for the demonstration, the analysis is run only a random feature subset of the data.
The returnAnalysis parameter can be used, to extract the
analysis which will be used as input for the esetPlotWrapper function,
used in the background. It is available also for the esetSpectralMap
and esetTnse functions.
# extract random features, because analysis is quite time consuming retainedFeatures <- sample(featureNames(ALL), size = floor(nrow(ALL)/5)) # run the analysis outputEsetLda <- esetLda(eset = ALL[retainedFeatures, ], ldaVar = "BT", title = paste("Acute lymphoblastic leukemia dataset \n", "Linear discriminant analysis on BT variable \n", "(Subset of the feature data)"), colorVar = "BT", color = colorPalette, shapeVar = "sex", sizeVar = "age", sizeRange = c(2, 6), alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9), topSamplesVar = "cod", topGenesVar = "SYMBOL", returnAnalysis = TRUE ) # extract and print the ggplot object print(outputEsetLda$plot)
The top elements (here genes and samples) labelled in the plot can be accessed
via the topElements slot of the object. These are labelled with the
identifiers used in the plot and named with sample/gene identifiers of the
eset.
# extract top elements labelled in the plot pander(t(data.frame(topGenes = outputEsetLda$topElements[["topGenes"]]))) pander(t(data.frame(topSamples = outputEsetLda$topElements[["topSamples"]])))
When returnAnalysis is set to TRUE, the output of the function can be
used as input to the esetPlotWrapper function, and extra parameters can
then be modified.
Here the variable used for the shape of the points for the samples is changed to type of remission (remissionType column).
# to change some plot parameters esetPlotWrapper( dataPlotSamples = outputEsetLda$analysis$dataPlotSamples, dataPlotGenes = outputEsetLda$analysis$dataPlotGenes, esetUsed = outputEsetLda$analysis$esetUsed, title = paste("Acute lymphoblastic leukemia dataset \n", "Linear discriminant analysis on BT variable (2) \n", "(Subset of the feature data)"), colorVar = "BT", color = colorPalette, shapeVar = "remissionType", sizeVar = "age", sizeRange = c(2, 6), alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9), topSamplesVar = "cod", topGenesVar = "SYMBOL" )
rm(outputEsetLda);tmp <- gc(verbose = FALSE)
From the previous visualization (obtained on a subset of the genes), the biggest difference between all levels of the type/stage of the disease seems to reside between all B-cells (tagged B) and T-cells (tagged T). It might be interesting to focus on a subset of the data, e.g. only one cell type.
# keep only 'B-cell' samples ALLBCell <- ALL[, grep("^B", ALL$BT)] ALLBCell$BT <- factor(ALLBCell$BT) colorPaletteBCell <- colorPalette[1:nlevels(ALLBCell$BT )] print(esetLda(eset = ALLBCell[retainedFeatures, ], ldaVar = "BT", title = paste("Acute lymphoblastic leukemia dataset \n", "Linear discriminant analysis on BT variable \n B-cell only", "(Subset of the feature data)" ), colorVar = "BT", color = colorPaletteBCell, shapeVar = "sex", sizeVar = "age", sizeRange = c(2, 6), alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9), topSamplesVar = "cod", topGenesVar = "SYMBOL", ))
rm(ALLBCell);tmp <- gc(verbose = FALSE)
The subcell type which seems to differ the most within all B-cell type is the first one: B1 (most of these samples at the right side of the plot).
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.