In ECSchmitt/PPClustA: Shiny APP to visualise gene expression measured by Golub et al

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

PPClustA

The PPClustA package contains a shiny APP to explore and visualize Gene expression data of Acute Lymphatic Leukemia (ALL) and Acute Myeloid Leukemia (AML) patients from Golub et al. applying Principal Component Analysis and Hierarchical Clustering methods.

Required Packages

Installation

You can install the current version of PPClustA from GitHub using devtools install_github() function:

if(!requireNamespace("devtools", quietly = TRUE))
  install.packages("devtools")

devtools::install_github("ECSchmitt/PPClustA")

Some Theory behind the Package

Looking at expression data of many genes across many individuals we are confronted with a vast amount of information. Such data is multidimensional and therefore needs to be simplified to become comprehensible to us without losing important information. In case of the Golub data the authors were interested in finding and distinguishing new cancer sub-types. But how can we use expression data to define cancer sub-types?

Measuring Distances

To tackle this problem we need to find a measure allowing us to determine similarities between the expression vectors per patient and gene enabling us to group similar vectors together. A measure providing this functionality is the distance between two vectors. Similar distances indicate closer feature vectors within vector space. A few example for such measures are:

To calculate a distance matrix you need some kind of input that the dist() function of R can handle. Such an input is either a two dimensional matrix or a vector. Both have to have numeric types. A random 6*6 numerical matrix can be produced as follows

#producing a 6*6 matrix with random values
matrix <- matrix(rnorm(36), nrow = 6)

matrix <- matrix(rnorm(36), nrow = 6)
matrix

To obtain a distance between the matrix columns (in case of a matrix input) or the vector members (in case of a vector input) one has to apply R's dist() method. It can be called with different distance measures like "euclidean", "manhattan","maximum" and others (use ?dist() to obtain a full list of available measurements). An example for how a distance matrix is calculated from a 6*6 matrix filled with random values is depicted below.

#producing a 6*6 matrix with random values
matrix <- matrix(rnorm(36), nrow = 6)

#calculating a distance matrix
distance_matrix <- dist(matrix, method = "euclidean")

distance_matrix <- dist(matrix, method = "euclidean")
distance_matrix

Hierarchical Clustering

Hierachical clustering can either be agglomerating (bottom-up) or divisive (top-down). Bottom-up approaches start with all observations in one cluster and iteratively group nearby clusters together until one large cluster is formed. Top-down approaches start with all observations within one cluster and separate this cluster subsequently into smaller clusters until every observation is within its own cluster. Visualization is done by using a dendrogram where straight lines depict distance between clusters and horizontal lines group similar clusters together. To obtain such a dendrogram it is crucial to determine the distance between the newly formed clusters in every iteration step. Examples for such measures are listed below. Please follow the links for information in more detail.

Single linkage finds the minimum distance between points belonging to two different clusters
Complete linkage determines the maximum distance between points to two different clusters
Average linkage (UPGMA) calculates all pairwise distances of point in two different clusters and takes the average
Average linkage (WPGMA) similar to UPGMA but with weighted distances
Centroid finds the centroid of each cluster and determines distance between centroids of two different clusters
Ward minimizes the overall distance

Given a distance matrix hierarchical clustering methods can be applied via R's hclust() function. Within hclust() the "method" parameter takes a string object indicating the required clustering formula ("single", "complete"...). Printing an object produced by hclust() via the print() function produces a dendrogram depicting the distances of all clusters.

#producing a 6*6 matrix with random values
matrix <- matrix(rnorm(36), nrow = 6)

#calculating a distance matrix
distance_matrix <- dist(matrix, method = "euclidean")

#applying hierarchical clustering
hc <- hclust(distance_matrix, method = "single")

#printing the clustering
print(hc)

matrix <- matrix(rnorm(36), nrow = 6)
distance_matrix <- dist(matrix, method = "euclidean")
hc <- hclust(distance_matrix, method = "single")
plot(hc, xlab = "")

Principal Component Analysis

A gene expression matrix provides a multidimensional space across numerous features. It is not trivial presenting such data in a way that we can easily comprehend it as we are only familiar with 2 dimensional plotting. Principal Component Analysis is a way to reduce the dimension and plot the reduced data into 2 dimensional space. To achieve this, observations are projected onto vectors producing lines with respect to keeping the distances between points and line low while covering as much variance as possible. This procedure generates linear combinations of observation vectors called principal components (PC). These (PC) are then plotted into a two dimensional space. Usually, the two PC covering the largest variances are plotted as they are likely to represent the most important features.

# prducing a 100*100 matrix with random values
matrix <- matrix(rnorm(1000), nrow = 100)

#calculating a distance matrix
distance_matrix <- dist(matrix, method = "euclidean")

#calculating PCA
pca <- prcomp(distance_matrix)

#plotting pca
plot(pca$x)

matrix <- matrix(rnorm(1000), nrow = 100)
distance_matrix <- dist(matrix, method = "euclidean")
pca <- prcomp(distance_matrix)
plot(pca$x)

You might now be disappointed by the sample image above as it doesn't show fancy patterns. Since we use a randomly distributed matrix in our example, the distances between column vectors are likely to be quite similar and therefor the PCA may not contain interesting patterns. To investigate whether that is true, one can plot the principal components as a histogram to find out whether their contained variance is is similar. Such a plot can is part of a PCA object produced by prcomp() and can be plotted as follows:

# prducing a 100*100 matrix with random values
matrix <- matrix(rnorm(1000), nrow = 100)

#calculating a distance matrix
distance_matrix <- dist(matrix, method = "euclidean")

#calculating PCA
pca <- prcomp(distance_matrix)

#plotting pc histogram
plot(pca)

matrix <- matrix(rnorm(1000), nrow = 100)
distance_matrix <- dist(matrix, method = "euclidean")
pca <- prcomp(distance_matrix)
plot(pca)

Starting the Shiny Application and using Reactive Components

This Package was built to enable users to explore and visualize hierarchical clustering methods and principal component analysis on well know benchmark data. For this purpose a shiny app was included in it which can be started after loading the package.

Running the App

To start the shiny app, load the PPClustA library and call the PPClustA::runapp() function

library(PPClustA)
PPClustA::runapp()

Reactive components

Calling the runapp function starts a shiny dashboard comprised of two tabs which can be selected by clicking on them with the mouse cursor. The active tab will appear grey while the inactive one will stay blue until clicked. For better orientation see figure below.

Within the "Hierarchical Clustering" tab you can find four interactive elements. On the top left side there is a slider which determines the number of genes to be included in calculating the distance matrix between patients. It can be adjusted to any integer between 1 to 500. Below that slider there are the "Distance Measure" and the "Clustering Method" box in which one can select methods to be applied during distance matrix calculation and hierarchical clustering computation. Be invited to play around with them and observe how the plot on the right side changes. Finally, you can find a download button bottom left which allows you to export the produced plot as .pdf file. For a detailed overview or in case you struggle with finding one of the mention options please have a look at the figure below. The "Principal Component" tab contains a slider in the top left as well. Similar to the one mentioned before, the amount of genes included in computing a PCA can be selected with it. Within the plot area two plots are displayed this time. The left one plots two principal components chosen by the user an reacts to the amount of genes slider as well as to the two selection boxes below. Choose any of the 15 principal components, plot them against each other and see what happens! To add a more colorful plot the right graph shows the same PCA but will always represent the first two principal components. To compensate for this lack in activity, the data points within the graph are colored according to the cancer-type of the underlying patient and the covered variance is plotted behind the axis labels.