Visualizing Multivariate Data with Radviz"

knitr::opts_chunk$set(warning=FALSE,
                      fig.retina=1,
                      fig.keep='high',
                      fig.align='center')

Abstract

The Radviz package implements the concept of dimensional anchors to visualize multivariate datasets in a 2D projection, as originally described in Hoffman et al 1999. Additional work for ordering of dimensional anchor is taken from Di Caro et al 2012.

Introduction

So called big data has focused our attention on datasets that comprise a large number of items (or things). As a consequence the fact that we are measuring (or recording) more and more parameters (or stuff) is often overlooked, even though this large number of things is enabling us to explore the relationships between the different stuff with unprecedented efficacy.

Yet, this increase in the stuff we measure comes with a so-called Curse of Dimensionality: as our brains are unable to efficiently deal with more than 3 dimensions, we need new tools to visually explore larger (more than 3 dimensions) sets.

Several such tools exist that rely on projection into lower number of dimensions, such as PCA. Although those tools represent an efficient solution, they do not allow for direct interpretation of the position of the points in space.

Radviz provides an elegant solution to this problem, by projecting a N-dimensional data set into a simple 2D space where the influence of each dimension can be interpreted as a balance between the influence of all dimensions.

Building a radviz projection

In Radviz, each dimension in the dataset is represented by a dimensional anchor, and each dimensional anchor is distributed evenly on a unit circle. Each line in the data set corresponds to a point in the projection, that is linked to every dimensional anchor by a spring. Each spring's stiffness corresponds to the value for that particular thing in that particular dimension. The position of the point is defined as the point in the 2D space where the spring's tension is minimum.

Installation

The package can be installed using the following command:

devtools::install_github('yannabraham/Radviz')

Once installed the package can be loaded using the standard library command.

library(Radviz)

Visualizing High Dimensional Data: the bodenmiller dataset

We will use data from the Bodenmiller et al 2012 publication. The bodenmiller package contains a subset of the data that can be used for testing purpose. The package can be installed by running the following command:

install.packages('bodenmiller')

Data has already been cleaned and transformed, and can be used as is.

library(bodenmiller)
data(refPhenoMat)
data(refFuncMat)
data(refAnnots)
refMat <- cbind(refPhenoMat,refFuncMat)

Normalizing the data

Radviz requires that all dimensions have the same range. The simplest way to achieve that is to normalize every numerical column to its range, so that all values fall in the [0,1] interval. The do.L function performs this operation for us:

norm <- apply(refMat,2,do.L,fun=function(x) quantile(x,c(0.025,0.975)))

Defining the anchors

Let's define a Spring object that contains the position of each dimension on the unit circle. The dimensions will be matched by column name to the data, so it is important to make sure that column names are syntactically valid, using the make.names function.

The make.S function will take a vector of column names and return a matrix of [x,y] positions. The order of the column names will dictate the position on the circle, starting at 12:00 and going clockwise. It is not required to use all avaiable columns: it might be worth splitting the dimensions depending on what they represent (continuous versus categorical variables, phenotypic versus functional markers, etc.).

ct.S <- make.S(dimnames(refPhenoMat)[[2]])

Optimizing the position of anchors

The position of dimensional anchors on the circle is critical: the best projection (as judged by the separation of points in the display) is achieved when dimensional anchors that correspond to highly correlated dimensions in the data are placed closer on the unit circle.

Optimization of the position of dimensional anchors is discussed in Di Caro et al 2012, where the authors suggest 2 metrics that can be used in an optimization process. Both methods start with a similarity matrix where every value represents the similarity between any 2 dimensions. The first method assumes that dimensional anchors corresponding to similar dimensions should be close on the circle (radviz-independent); the second method uses a projection of the similarity matrix in radviz and computes the distance of the dimensional anchors to the projected dimensions they represent (radviz-dependent) Di Caro et al 2012.

The radviz-dependent and independent methods have been implemented in the Radviz package, as well as the recommended cosine similarity measure.

## compute the similarity matrix
ct.sim <- cosine(norm)
## the current radviz-independent measure of projection efficiency
in.da(ct.S,ct.sim)
## the current radviz-independent measure of projection efficiency
rv.da(ct.S,ct.sim)

The radviz-independent score should be maximal when the dimensional anchor positions are optimal. The radviz-dependent score should be minimal when the dimensional anchor positions are optimal.

The optimization procedure works as follows:

optim.ct <- do.optim(ct.S,ct.sim,iter=100,n=1000)
ct.S <- make.S(tail(optim.ct$best,1)[[1]])

The original anchor position corresponds to the order of the columns in the matrix:

ksink <- lapply(dimnames(refPhenoMat)[[2]],function(x) cat(' *',x,'\n'))

The optimized anchors are:

ksink <- lapply(row.names(ct.S),function(x) cat(' *',x,'\n'))

Projection

The do.radviz function will then use the normalized values and the Springs to project each thing in a 2D space:

ct.rv <- do.radviz(norm,ct.S)

Visualizing the results

There is a S3 plot function defined for radviz; using the default will give the following result

plot(ct.rv)

One can modify the shape of the points using the point.shape argument:

plot(ct.rv,point.shape=1)

Or color each dot using the point.color argument, to highlight the different cell types.

plot(ct.rv,point.shape=1,point.color=refAnnots$Cells)

This simple plot is already enough to show some of the underlying structure of the dataset.

Using standard visualizations developed for FACS data analysis, one can further explore the data. The first alternative uses smoothed densities returned by a kernel density estimate to generate a 2D density plot:

smoothRadviz(ct.rv)

Another alternative is to generate a contour plot from the data using the do.density function:

ct.rv <- do.density(ct.rv,n=100)

Resulting in the following plot:

contour(ct.rv)

Plots can be overlayed using the add parameters:

smoothRadviz(ct.rv)
contour(ct.rv,add=T)

This method can be used to highlight specific cell populations, in combination with subset.radviz:

cur.pop <- 'igm+'
sub.rv <- subset(ct.rv,refAnnots$Cells==cur.pop)
smoothRadviz(ct.rv)
contour(sub.rv,add=T)

The contour corresponds to r cur.pop cells, in the context of the full sample visualized as a smooth scatter.

Visualizing Signal Intensity for Functional Channels

Further insight can be gained from the use of hexagonal binning:

ct.rv <- do.hex(ct.rv,n=60,
        channels=dimnames(refFuncMat)[[2]],
        ncols=7,
        use.quantile=T
)

Leading to the following plot:

hexplot(ct.rv,mincnt=5)

Each bin can then be colored according to other dimensions in the dataset, such as S6 phosphorylation:

hexplot(ct.rv,mincnt=2,color='pS6')

To be compared with phospho-Akt signal:

hexplot(ct.rv,mincnt=2,color='pAkt')

And phospho-Erk signal:

hexplot(ct.rv,mincnt=2,color='pErk')

Visualizing Cell Populations

Cells have been manually gated, leading to 14 sub-populations:

ksink <- lapply(levels(refAnnots$Cells),function(x) cat(' *',x,'\n'))

To visualize the different populations, we compute a median phenotypic signal intensity by cell population, and project it in radviz:

library(colorspace)
pop.norm <- apply(refMat,2,do.L,fun=function(x) quantile(x,c(0.025,0.975)) )
pop.norm <- apply(pop.norm,2,function(x) tapply(x,refAnnots$Cells,median) )
pop.rv <- do.radviz(pop.norm,ct.S)
pop.size <- table(refAnnots$Cells)
pop.cols <- setNames(rainbow_hcl(length(levels(refAnnots$Cells))),
                     levels(refAnnots$Cells))

Populations are visualized as a bubble chart: the bubble position corresponds to the median signal intensity of each channel for this specific population, and the bubble size is proportional to the number of cells in this population:

par(mfrow=c(1,2))
bubbleRadviz(pop.rv,
             bubble.color=pop.cols[dimnames(pop.norm)[[1]]],
             bubble.size=log(pop.size[dimnames(pop.norm)[[1]]]),
             scale=0.2,
             decreasing=TRUE
)
plot.new()
legend("center",
       legend=names(pop.cols),
       col=pop.cols,
       cex=0.8,
       pch=16,
       bty='n')

Alternatively, bubbles can be colored after signal intensity of a functional channel, such as S6:

S6.cols <- setNames(colorRampPalette(blues9)(8)[cut(pop.norm[,'pS6'],
                                                    breaks=8,
                                                    labels=F,
                                                    include.lowest=TRUE)],
                    dimnames(pop.norm)[[1]])
bubbleRadviz(pop.rv,
             bubble.color=S6.cols[dimnames(pop.norm)[[1]]],
             bubble.fg='grey',
             bubble.size=log(pop.size[dimnames(pop.norm)[[1]]]),
             scale=0.2,
             decreasing=TRUE
)

The population with the highest pS6 signal intensity corresponds to r names(which.max(pop.norm[,'pS6'])) cells.

Future directions

Radviz is an efficient visualization for multivariate datasets that allow for direct interpretation of results. This package provides a starting point to apply radviz to several datasets with minimal effort. Future developments include:



Try the Radviz package in your browser

Any scripts or data that you put into this service are public.

Radviz documentation built on May 29, 2017, 2:39 p.m.