First load your data (code hidden).

library(flowCore)
library(pheatmap)
library(cluster)
library(ggplot2)
library(reshape)
library(DirichletReg)
library(pls)
devtools::load_all(".")
myred <- rgb(0.8, 0.1, 0.4, 0.4)
mygreen <- rgb(0, 0.4, 0.25, 0.4)
myblue <- rgb(0.4, 0.4, 0.9, 0.2)
mygrey <- rgb(0.2, 0.2, 0.2, 0.4)
mymauve <- rgb(0.8, 0.1, 0.8, 0.4)


load('data/model.RData')
load('data/D.RData')

sample_names <- c('K1', 'K2', 'M1', 'M2', 'T1', 'T2', paste('HD', 1:4, sep = ''), paste('K', 1:17 + 2, sep = ''), paste('M', 1:8 + 2, sep = ''), paste('T', 1:21 + 2, sep = ''))
sample_names1 <- c('BM', 'BM', 'S', 'S', 'T', 'T', rep('PB', 4), rep('BM', 17), rep('S', 8), rep('T', 21))
sample_names <- sample_names[-c(11, 21, 28)]
sample_names1 <- sample_names1[-c(11, 21, 28)]

Extract table of cell counts in each bin for each donor, and generate a biplot.

#model <- make_bins(D_transform, D$sample_fac, 4, 14)
donor_coords <- get_xtab(model, 0.5, sample_names1)
donor_coords <- logratio_data(donor_coords + 0.00001)

L <- svd(donor_coords)
bp <- make_biplot(L, 1, 3)
plot(bp$F, type = 'n', xlab = 'biplot axis 1', ylab = 'biplot axis 2', cex.lab = 1.5, main = 'biplot of donors')
text(bp$F, labels = sample_names1)

Let's now look at the arrangement of bins in this biplot. Each point is a bin, and their arrangment is informative of how they contribute to the separation of donors.

plot(bp$G, pch = 16, cex = 0.5, main = 'biplot of bins, based on cell counts')

On its own, it is not very informative. We are only viewing information about how cell counts correlate across bins, and nothing about the spectral fingerprint of each bin. We have to do a little more work to get this out. First, get all events that correspond to each bin (combining all the samples). Each row is an event (cell) and each column is a channel. Then get the column/channel medians as a summary of each bin.

data_bins <- get_bins_lineplot(model, w = 0.5, D$D_comp, sample_fac = D$sample_fac, rebin = FALSE)

markers <- c('IgA', 'IgD', 'IgM', 'IgG', 'CD27', 'CD19', 'Dump', 'CD20')
data_bins_median <- mapply(function(x) apply(x, 2, median), data_bins[[1]])
rownames(data_bins_median) <- markers
# pheatmap(t(scale(cbind(bp$G[-1, ], row_col_center(t(data_bins_median))))), cluster_rows = FALSE)

#CL_all <- all_hclust(row_col_center(t(data_bins_median)), 2:30)
#component <- comp_svd(t(data_bins_median))$u
#plot(clustergram(CL_all, row_col_center(t(data_bins_median))[, 4]))
#plot(clustergram(CL_all, component[, 3]))

Now we can visualise this data. First we row and column center the data, as in preparation for SMA. We can make a heatmap

pheatmap(row_col_center(t(data_bins_median)), cluster_cols = FALSE, clustering_method = 'ward.D2')

We can also make a biplot of the data. Note the three major branches of cells.

plot(comp_svd(t(data_bins_median))$u, pch  = 16, cex = 0.5, main = 'biplot of bins, based on spectra')

We can cluster the bins according to spectral similarity using Ward's method.

CL <- hclust(dist(row_col_center(t(data_bins_median))), method = 'ward.D2')
plot(CL)

In general, there's no such thing as the 'best' clustering. Here, we assume that cells jointly have some non-random (non-spherical) probability distribution, but are potentially on a continuum. Therefore, we are interested in how the hierarchy of cell types evolve. Here, we investigate 2 - 10 clusters, defined by successive splits of the dendrogram above. The above biplot is replotted, with bins colour-coded according to cluster membership. Each successive panel corresponds to a finer clustering. We are also interested in how cell types (defined by increasing similarity in spectral fingerprint) are localised in the cell count biplot. We can see that they are localised, yet overlapping (at least in the first 2 components).

cut_CL <- cutree(CL, k = 2:20)
df <- do.call(rbind, lapply(1:9, function(i) cbind(row_col_center(t(data_bins_median)), bp$G[-1, ], cut_CL[, i], i+1)))
colnames(df) <- c(markers, paste('bp', 1:3, sep = ''), 'CL', 'N')
df <- as.data.frame(df)
df$N <- factor(df$N)
df$CL <- factor(df$CL)
p <- ggplot(df, aes(x = bp1, y = bp2, colour = CL))
p + geom_point() + facet_wrap( ~ N)

Let's say we are thinking 'conventionally' in terms of individual markers, we might want to know which markers contributes to the separation between samples. To visualise this, we can plot the first component of the cell count biplot with each markers' expression. With perhaps the exception of IgA, there is no neat relationship. This underlines the need to distinguish between cell types via their spectral fingerprints - cells in two different 'states' can have identical expression of a marker. If you try and reconstruct this by sequential gating you can never quite resolve all these multivariate connections.

df1 <- df[, !(names(df) %in% c('bp2', 'bp3', 'CL', 'N'))]
df1 <- melt(df1, id.vars = 'bp1')
ggplot(df1, aes(x = bp1, y = value)) + geom_point() + facet_wrap(~ variable)

This is an additional plot showing how marker expression is localised in the biplot. There are similarities to the plot above where bins are colour coded according to clusters

df1 <- cbind(bp$G[-1, ], row_col_center(t(data_bins_median)))
colnames(df1) <- c(paste('bp', 1:3, sep = ''), markers)
df1 <- as.data.frame(df1)
df1 <- melt(df1, id.vars = c('bp1', 'bp2', 'bp3'))
ggplot(df1, aes(x = bp1, y = bp2, colour = value)) + geom_point() + facet_wrap(~ variable)

This plot summarises marker expression in the ten clusters of bins. It looks quite specific.

#ggplot(df, aes(x = CL, y = IgG)) + facet_wrap( ~ N) + geom_boxplot()

df_reshape <- melt(df[, !(names(df) %in% c('bp1', 'bp2', 'bp3'))])
p1 <- ggplot(df_reshape, aes(x = variable, y = value)) + geom_boxplot() + facet_wrap( ~ CL)
p1 %+% subset(df_reshape, N == 10)
markers

This can be compared to a coarser level of the cluster hierarchy. The clusters are less specific. However, on the right, the bins are specifically high in the dump channel and for CD27. On the left, they are specifically low in the dump channel and high for CD19 and CD20 (B cells). The middle boxplot is a mixture of cells. Therefore, rather than sequentially gating on individual channels, maybe there is a possibly for multivariate gating?

p1 %+% subset(df_reshape, N == 3)

Here I've tried to display the hierarchy of cell types in a conventional dotplot. I chose the pairs of parameters arbitrarily. It would be hard to replicate this with manual gating as some 'populations' are completely spread out in some channels, yet they are grouped together because of the underlying spectral similarity.

# make fcs files

dotplot_cluster <- lapply(1:6, function(AB) {
tmp_file <- lapply(unique(cut_CL[, AB]), function(J) do.call(rbind, data_bins[[1]][cut_CL[, AB] == J]))
dotplot <- do.call(rbind, lapply(tmp_file, function(.f) .f[sample(nrow(.f), floor(0.1*nrow(.f))), ]))
col <- rep(1:length(tmp_file), sapply(tmp_file, function(.f) floor(0.1*nrow(.f))))
return(list(dotplot = dotplot, col = col))
}
)

for (i in 1:6 ) {
par(mfrow = c(2, 2))
plot(dotplot_cluster[[i]]$dotplot[, c(1:2)], col = dotplot_cluster[[i]]$col, pch = 16, cex = 0.2, xlab = markers[1], ylab = markers[2])
plot(dotplot_cluster[[i]]$dotplot[, c(3:4)], col = dotplot_cluster[[i]]$col, pch = 16, cex = 0.2, xlab = markers[3], ylab = markers[4])
plot(dotplot_cluster[[i]]$dotplot[, c(5:6)], col = dotplot_cluster[[i]]$col, pch = 16, cex = 0.2, xlab = markers[5], ylab = markers[6])
plot(dotplot_cluster[[i]]$dotplot[, c(7:8)], col = dotplot_cluster[[i]]$col, pch = 16, cex = 0.2, xlab = markers[7], ylab = markers[8])
}

It's possible to visualise how cell counts in the clusters differ with respect to the four categories using a boxplot (with counts scaled by square root for convenience).

dc <- get_xtab(model, 0.5, sample_names1)
df2 <- data.frame(sapply(unique(cut_CL[, 9]), function(i)  rowSums(dc[, -1][, cut_CL[, 9] == i])))
df2 <- cbind(df2, factor(sample_names1))
colnames(df2) <- c(paste('N', unique(cut_CL[, 9]), sep = ''), 'samples')
rownames(df2) <- NULL
df2 <- data.frame(df2)                   
df2 <- melt(df2, id = 'samples')

#ggplot(df2, aes(x = samples, y = sqrt(value))) + geom_boxplot() + facet_wrap( ~ variable)
ggplot(df2, aes(x = variable, y = sqrt(value))) + geom_boxplot() + facet_wrap( ~ samples)

Let's try digging a little deeper into these count patterns, and how they interact with marker expression. First we can fit a model to estimate the cell counts for tissue type in each cluster.

counts <- sapply(unique(cut_CL[, 9]), function(i)  rowSums(dc[, -1][, cut_CL[, 9] == i]))
counts <- cbind(dc[, 1], counts)
counts[counts == 0] <- min(counts[counts > 0])/2
dr_df <- data.frame(sample = factor(sample_names1))
dr_df$counts <- DR_data(counts)

model <- DirichReg(counts ~ sample, dr_df)
summary(model)
# this could be used to visualise location estimation above
# counts <- dc[, -1][, cut_CL[, 9] == 1]
# counts[counts == 0] <- min(counts[counts > 0]) / 2
# counts <- logratio_data(counts)
# L <- svd(counts)
# bp <- make_biplot(L, 1, 3)
# plot(bp$F, type = 'n', xlab = 'biplot axis 1', ylab = 'biplot axis 2', cex.lab = 1.5)
# text(bp$F, labels = sample_names1)
# experiments with co-inertia analysis
# library(ade4)
# par(mfrow = c(3, 4))
# for (i in 1:10 ) {
# i=1
# pca1 <- dudi.coa(t(data_bins_median)[cut_CL[, 9] == i, ], scan = FALSE,  nf = 8)
# dc <- get_xtab(model, 0.5, sample_names1)
# dc <- t(donor_coords[, -1])[cut_CL[, 9] == i, ]
# 
# pca1 <- dudi.pca(row_col_center(10^t(data_bins_median)), scan = FALSE, nf = 8, scale = FALSE)
# 
# dc <- get_xtab(model, 0.5, sample_names1)
# dc <- t(dc[, -1])
# dc[dc == 0] <- min(dc[dc > 0])/2
# colnames(dc) <- sample_names
# dc <- row_col_center(dc)
# pca2 <- dudi.pca(dc, scan = FALSE,  nf = 8, scale = FALSE)
# coin <- coinertia(pca1, pca2, scan = FALSE, nf = 8)
# 
# col <- rep(c('red', 'blue'), c(nrow(coin$lX), nrow(coin$lY)))
# pch <- rep(15:16, c(nrow(coin$lX), nrow(coin$lY)))
# plot(rbind(as.matrix(coin$lX[, c(1, 2)]), as.matrix(coin$lY[, c(1, 2)])), pch = pch, col= col, cex = 0.5)
# segments(coin$lX[, 1], coin$lX[, 2], coin$lY[, 1], coin$lY[, 2], lwd = 0.2)


# do PLS individually for each marker
# heatmap of correlation  of count matrix (subgroup of bins according to dendrogram cut_CL), ordered according to PLS predicting marker expression in the bins. 
# pheatmap(cor(donor_coords[order(sample_names1), -1][, cut_CL[, 9] == 3])[order(pls_markers[[1]]$scores[, 1]), order(pls_markers[[1]]$scores[, 1])], colorRampPalette(c(myred, mygrey, myblue), bias = 1)(100), cluster_rows = FALSE, cluster_cols= FALSE, border_color = NA)

# the trend in the above correlation matrix isn't so obvious in this heatmap of the count matrix. This is because individual donors display a lot of variation. In the correlation matrix, the information is pooled across donors, thereby strengthening the information. 
# pheatmap(donor_coords[order(sample_names1), -1][, cut_CL[, 9] == 3][, order(pls_markers[[1]]$scores[, 1])], colorRampPalette(c(myred, mygrey, myblue), bias = 1)(100), cluster_rows = FALSE, cluster_cols= FALSE, border_color = NA)


# heatmap of expression level, matching the heatmap above. ordered according to PLS predicting marker, you will notice that the predicted marker has  a gradient, other markers not necessarily so. 
# pheatmap(data_bins_median[, cut_CL[, 9] == 3][, order(pls_markers[[1]]$scores[, 1])], colorRampPalette(c(myred, mygrey, myblue), bias = 1)(100), cluster_rows = FALSE, cluster_cols= FALSE, border_color = NA)

Within a cluster, is there a gradient of marker expression, and does this influence cell counts? Remember, there may be individual gradients for each marker, but it's another question as to how the marker gradients interact, i.e. what is the correlation between markers? Let's investigate cluster 1 (from 10 clusters). Looking at the boxplot of marker expression in each bin, they t-cells/monocytes (high in Dump and CD27, low in CD19 and CD20)

df <- data.frame(X = I(t(donor_coords[order(sample_names1), -1])[cut_CL[, 9] == 1, ]), Y = I(t(row_col_center(10^data_bins_median))[cut_CL[, 9] == 1, ]))

par(mfrow = c(1, 1))
boxplot(t(row_col_center(10^data_bins_median))[cut_CL[, 9] == 1, ], ylab = 'marker expression')

First of all, fit a partial least squares (PLS) model for cluster 1, using the count matrix to predict the marker expression. Using 4 components (arbitrarily chosen), it is possible to predict quite well the marker expression, meaning there is further sub-structure. Here is a plot of predicted vs. actual marker expression.

pls_model_all <- plsr(Y ~ X, data = df)
plot(pls_model_all, line = TRUE, ncomp = 4)
par(mfrow = c(1, 1))

The sucessive PLS components correspond to the directions in which both marker expression and counts mutually have most variance. Along these directions (axes), each bin has a score. We want to see how bin score correlates with marker expression. Here we inspect the first component. Equivalently, we can also inspect the correlation matrix of the markers.

par(mfrow = c(3, 3))
for (i in 1:8)
plot(df$Y[order(pls_model_all$scores[, 1]), i], xlab = 'rank bin score 1st comp', ylab = 'expression', main = markers[i])

pheatmap(cor(df$Y))

In the same fashion, let's look at median cell count with respect to the rank of the bin score. Just judging by eye, there seem to be a nice relationship with tonsil and spleen, but not bone marrow and p. blood. It means that on average, tonsil cell count increases across the marker gradient whereas spleen decrease, and 'it's complicated' for the other two.

par(mfrow = c(2, 2))
for (i in 1:4) {
d_boxplot <- donor_coords[, -1][, cut_CL[, 9] == 1][, order(pls_model_all$scores[, 1])][rownames(donor_coords) == c('BM', 'T', 'S', 'PB')[i], ]
plot(apply(d_boxplot, 2, median), xlab = 'rank bin score 1st comp', ylab = 'cell count', main = c('BM', 'T', 'S', 'PB')[i])
#boxplot(d_boxplot, xlab = 'rank bin score 1st comp', ylab = 'cell count', main = c('BM', 'T', 'S', 'PB')[i])
}

However, in the second component there is potentially a nice relationship for bone marrow, related to a more complicated combination of markers.

par(mfrow = c(2, 2))
for (i in 1:4) {
d_boxplot <- donor_coords[, -1][, cut_CL[, 9] == 1][, order(pls_model_all$scores[, 2])][rownames(donor_coords) == c('BM', 'T', 'S', 'PB')[i], ]
plot(apply(d_boxplot, 2, median), xlab = 'rank bin score 1st comp', ylab = 'cell count', main = c('BM', 'T', 'S', 'PB')[i])
#boxplot(d_boxplot, xlab = 'rank bin score 1st comp', ylab = 'cell count', main = c('BM', 'T', 'S', 'PB')[i])
}

We can look at the heatmap of the cell count matrix, ordered by this first component to see the gradient in cell counts.

pheatmap(donor_coords[order(sample_names1), -1][, cut_CL[, 9] == 1][order(pls_model_all$loadings[, 1]), order(pls_model_all$scores[, 1])],  cluster_rows = FALSE, cluster_cols = FALSE, border_color = NA, )

how does this relate to marker-count cross-correlation matrix? cross tabulate count clustering and marker clustering

# pca1 <- dudi.pca(row_col_center(10^t(data_bins_median)), scan = FALSE, nf = 8, scale = FALSE)
# 
# dc <- get_xtab(model, 0.5, sample_names1)
# dc <- t(dc[, -1])
# dc[dc == 0] <- min(dc[dc > 0])/2
# colnames(dc) <- sample_names
# dc <- row_col_center(dc)
# pca2 <- dudi.pca(dc, scan = FALSE,  nf = 8, scale = FALSE)
# coin <- coinertia(pca1, pca2, scan = FALSE, nf = 8)
# Multiview clustering

#Try making a graph between count clusters, based on whether they have marker clusters in common (and vice versa). 

# CL_counts <- hclust(dist(t(donor_coords[, -1])), method = 'ward.D2')
# cut_CL_counts <- cutree(CL_counts, k = 2:20)
# 
# xtab_CL <- table(list(cut_CL_counts[, 19], cut_CL[, 19]))
# 
# xtab_df <- data.frame(V1 = paste(rep(1:nrow(xtab_CL), ncol(xtab_CL)), 'A', sep = ''), V2 = paste(rep(1:ncol(xtab_CL), rep(nrow(xtab_CL), ncol(xtab_CL))), 'B', sep = ''), weights = c(xtab_CL))
# xtab_df <- xtab_df[xtab_df$weights > 0, ]
# 
# G <- graph_from_data_frame(xtab_df, directed = FALSE)
# E(G)$weight <- (xtab_df$weights/max(xtab_df$weights))^1.2
# V(G)$color <- rep(c(myred, myblue), c(nrow(xtab_CL), ncol(xtab_CL)))
# #layout <- layout.reingold.tilford(G, circular=T)
# layout <- layout.fruchterman.reingold(G, niter = 10000)
# plot(G, edge.width = E(G)$weight*10, vertex.color = V(G)$color, layout = layout)
# heatmap of cell counts, bins ordered according to clustering of marker expression. 
# D_heatmap <- donor_coords[, -1][order(sample_names1), order(cut_CL[, 14])] 
# colnames(D_heatmap) <- as.character(cut_CL[, 14])[order(cut_CL[, 14])]
# pheatmap(D_heatmap, cluster_rows = FALSE, cluster_cols = FALSE, colorRampPalette(c('blue', 'black',   'red'), bias = 0.7)(1000))


kristenmf/flowR documentation built on May 20, 2019, 6:15 p.m.