Identifying Treatment effects using hilbertSimilarity"
In hilbertSimilarity: Hilbert Similarity Index for High Dimensional Data

library(knitr)
opts_chunk$set(warning=FALSE,
               fig.width=8,
               fig.height=8,
               fig.retina=1,
               fig.keep='high',
               fig.align='center')

Introduction

Comparing samples defined over a single dimension is a straightforward task that relies on standard, well established methods. Meanwhile distance between samples in high dimensional space remains a largely unexplored field. Available solutions rely on multivariate normal distributions, a condition that is both difficult to check and overlooking key behaviors in biological samples, where populations of interest often correspond to a small proportion (<1%) of all the points that have been measured.

We have developed hilbertSimilarity to address the problem of sample similarity in mass cytometry where samples are measured over up to 100 dimensions, and where each sample contains a slightly different number of points (or cells). Our method first transforms each sample into a probability vector, by dividing each dimension into a fixed number of bins and by associating each cell to a specific multidimensional cube. The proportion of cells in each hypercube is characteristic of a given sample. To characterize an compare samples we use measures from Information Theory, since their interpretation in terms of similarity and complexity is straightforward.

After determining sample similarity, our method can also be used to determine which cells are different between samples, by comparing the number of cells in each hypercube to a reference treatment. Since every sample contains a different number of cells, we use a bootstrap approach where the same number of cells is sampled from each treatment to allow for a direct comparison.

To demonstrate the power of hilbertSimilarity we applied the method to a subset of the bodenmiller et al. dataset, comparing the effect of different stimulations and identifying groups of cells that are significantly affected by different treatments.

Compared to other methods, hilbertSimilarity does not rely on expert-driven gating, or require any hypothesis about the number of populations that are present in the data. This makes it a powerful tool to quickly assess a dataset before using traditional methods, or when populations a not known a priori.

Installation

hilbertSimilarity can be installed using the following command

devtools::install_github(yannabraham/hilbertSimilarity)

Once installed the package can be loaded using the standard library command.

library(hilbertSimilarity)

Loading some test data

We first need data from the bodenmiller package:

library(bodenmiller)
data(refPhenoMat)
data(untreatedPhenoMat)
data(refFuncMat)
data(untreatedFuncMat)
data(refAnnots)
refAnnots$Treatment <- 'reference'
data(untreatedAnnots)
fullAnnots <- rbind(refAnnots[,names(untreatedAnnots)],
                    untreatedAnnots)
fullAnnots$Treatment <- factor(fullAnnots$Treatment)
fullAnnots$Treatment <- relevel(fullAnnots$Treatment,'reference')
refMat <- cbind(refPhenoMat,refFuncMat)
untreatedMat <- cbind(untreatedPhenoMat,
                      untreatedFuncMat)
fullMat <- rbind(refMat,untreatedMat)
fullMat <- apply(fullMat,2,function(x) {
    x[x<0] <- 0
    return(x)
  }
)

In this dataset r nrow(fullMat) cells corresponding to r nlevels(fullAnnots$Treatment) have been measured over r ncol(fullMat) channels. Cells have been gated into r nlevels(fullAnnots$Cells) populations. The following treatments are compared to one another:

nCellTreats <- with(fullAnnots,table(Cells,Treatment))
pCellTreats <- apply(nCellTreats,2,function(x) x/sum(x))
kable(nCellTreats)

The smallest cell population, r names(which.min(100*table(fullAnnots$Cells)/nrow(fullAnnots))), corresponds to r round(min(100*table(fullAnnots$Cells)/nrow(fullAnnots)),3)% of the total cells.

Computing the similarity between treatments

We will first compute the similarity between treatments, by generating a 3^rd^-order Hilbert curve using combined cuts:

computing the cuts

# compute the 
nbins <- 3
cuts <- make.cut(fullMat,
                 n=nbins+1,
                 count.lim=40)

cutting the matrix

cutFullMat <- do.cut(fullMat,cuts,type='combined')

ordering the dimensions

library(entropy)
miFullMat <- matrix(NA,nrow=ncol(fullMat),ncol = ncol(fullMat) )
for (i in seq(ncol(fullMat)-1)) {
  for (j in seq(i+1,ncol(fullMat))) {
    cur.tbl <- table(cutFullMat[,i],cutFullMat[,j])
    nent <- 1-mi.empirical(cur.tbl)/entropy.empirical(cur.tbl)
    miFullMat[i,j] <- miFullMat[j,i] <- nent
  }
}
dimnames(miFullMat) <- list(colnames(fullMat),colnames(fullMat))
hcFullMat <- hclust(as.dist(miFullMat))

generating the Hilbert curve

hc <- do.hilbert(cutFullMat[,rev(hcFullMat$order)],2)

Visualizing the different samples

library(reshape2)
library(dplyr)
library(ggplot2)
treatment.table <- with(fullAnnots,
                        table(hc,Treatment))
treatment.pc <- apply(treatment.table,2,function(x) x/sum(x))
av <- andrewsProjection(t(treatment.pc),breaks=40)
treatment.proj <- t(treatment.pc) %*% t(av$freq)
melt(treatment.proj) %>% 
    rename(AndrewsIndex=Var2) %>% 
    mutate(AndrewsIndex=av$i[AndrewsIndex]) %>% 
    ggplot(aes(x=AndrewsIndex,y=value))+
    geom_line(aes(group=Treatment,color=Treatment))+
    theme_light(base_size=16)+
    theme(axis.title.x=element_blank(),
          axis.text.x=element_blank(),
          axis.ticks.x=element_blank(),
          axis.title.y=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank())

Identifying treatment effects

For the comparison to be meaningful, we need to limit the analysis to bins that contain a minimum number of cells. This number is arbitrary, and represents the smallest group of cells that should be called a population. We also need to define the number of bootstrap, the significant fold change, and the limit of significance:

# total cells per Hilbert index per treatment
treatment.table <- with(fullAnnots,
                        table(hc,Treatment))
# minimum number of cells
lim <- 40
# number of bootstraps
N <- 400
# significant fold change
flim <- 2
# limit of significance
p <- 0.95
# bins with enough cells
boot.ind <- rownames(treatment.table)[which(rowSums(treatment.table)>=lim)]
# number of cells in each bootstrap
ncells <- floor(min(colSums(treatment.table[boot.ind,]))/1000)*1000

For this exercise we use r lim as the minimum population size. r length(boot.ind) indices out of r nrow(treatment.table) contain enough cells.

Because each treatment corresponds to a different number of cells, we will use a bootstrap approach to identify indices that are significantly different between conditions. we will bootstrap r ncells cells from the selected bins, then check which bin contains at least r flim times more cells in the reference compared to other treatments. This operation will be repeated r N times, and any bin that is consistently different in r p*N bins will be considered significant.

library(abind)
boot.mat <- lapply(seq(1,N),function(x) {
    if (x%/%100==x/100) cat(x,'\n')
    # bootstrap hilbert indices
    cur.boot <- sample(which(hc %in% boot.ind),
                       size=ncells,
                       replace=TRUE)
    return(table(factor(hc[cur.boot],levels=boot.ind),
                 droplevels(fullAnnots[cur.boot,'Treatment'])))
  }
)

boot.mat <- abind(boot.mat,along=3)

boot.log <- log2(boot.mat)
boot.fold <- sweep(boot.log[,-1,],c(1,3),boot.log[,1,])
boot.sign <- sign(boot.fold)
boot.sign[abs(boot.fold)<log2(flim)] <- 0
boot.res <- apply(boot.sign,c(1,2),function(x) table(sign(x))[c('-1','0','1')])
boot.res <- aperm(boot.res,c(2,1,3))
boot.res[is.na(boot.res)] <- 0
dimnames(boot.res)[[2]] <- c('-1','0','1')

After the analysis, the following number of bins are found significant:

sig.res <- apply(boot.res[,c('-1','1'),]>(N*p),c(1,3),any)
sum.sig.res <- cbind(colSums(sig.res),
                     colSums(treatment.table[boot.ind[rowSums(sig.res)>0],colnames(sig.res)]))

p.cells.sig.res <- lapply(colnames(sig.res),
                          function(cur.cl) {
                              x <- treatment.table[boot.ind,cur.cl]
                              return(sum(x[sig.res[,cur.cl]])/sum(x))
                            }
                          )
names(p.cells.sig.res) <- colnames(sig.res)
sum.sig.res <- cbind(sum.sig.res,
                     round(100*unlist(p.cells.sig.res),0))
colnames(sum.sig.res) <- c('Number of indices',
                           'Number of Significant cells',
                           'Percent Significant Cells')
kable(sum.sig.res)

Focusing on BCR/FcR-XL, which are the r sum(sig.res[,'BCR/FcR-XL']) indices that were found significant?

kable(treatment.table[boot.ind[sig.res[,'BCR/FcR-XL']],])

Compared to the unstimulated sample, cells are decreasing in 2 bins and increasing in a 3^rd^ one. Based on manual gating, what do these bins correspond to?

cur.treatment <- 'BCR/FcR-XL'
sig.cells <- with(fullAnnots,table(droplevels(Cells[hc %in% boot.ind[sig.res[,cur.treatment]]])))
total.sig.cells <- with(fullAnnots,table(droplevels(Cells[Cells %in% names(sig.cells) &
                                                       hc %in% boot.ind])))
treatment.sig.cells <- with(fullAnnots,
                            table(droplevels(Cells[hc %in% boot.ind[sig.res[,cur.treatment]] &
                                                     Treatment==cur.treatment])))
treatment.total.cells <- with(fullAnnots,
                              table(droplevels(Cells[Cells %in% names(sig.cells) &
                                                       hc %in% boot.ind &
                                                       Treatment==cur.treatment])))
sum.sig.cells <- cbind(sig.cells,
                       total.sig.cells,
                       round(100*sig.cells/total.sig.cells),
                       treatment.sig.cells,
                       treatment.total.cells,
                       round(100*treatment.sig.cells/treatment.total.cells))

colnames(sum.sig.cells) <- c('Number of Significant Cells',
                             'Total number of Cells',
                             'Percent of Total',
                             paste('Number of Signigicant Cells in',cur.treatment),
                             paste('Number of Cells in',cur.treatment),
                             paste('Percent of',cur.treatment))
kable(sum.sig.cells)

All the cells that are found significant are B cells, and they represent a large fraction of all B cells in the r cur.treatment sample, even when only taking the bootstrapped indices into account.