Introduction

Flow cytometry data is typically analysed by gating but this is very labour-intensive. The direct algorithmic translation of gating is clustering, but it is well known to be hard to find the 'best' clustering. Furthermore, when analysing a large number of samples, there is the additional obstacle of matching clusters across samples. Such algorithms often turn out to be computationally intensive.

A clustering paradigm also forces cells into a discrete number of containers, however it may be the case that cells are existing 'between' states, i.e. cells are on a continuum. It may be more appropriate to describe cells states in a probabalistic fashion. Clustering yields list-type structures and this obscures the dynamic relationship between markers, especially as it manifests across a large number of samples.

To this end, we propose viewing flow cytometry data such that systematic variation of the markers is emphasised. We devise summary matrices of a large number of samples that emphasise two aspects that are of key biological interest: marker coexpression at the sample level (bulk frequency), and marker coexpression at the single cell level (double positive frequency). There is also the possibility to examine how single cell coexpression depends on sample level coexpression. Some key features of this methodology is that it is computationally cheap, and summarises the data such that it can be readily integrated with other data sources, e.g. clinical covariates.

Data

Eight healthy donor samples are stimulated eight times each by eight different pathogens, yielding 64 samples. The pathogens are candida, E.coli and C.leptum (gut), mite, aspergillus and birchgrass (lung), CMV and influenza (virus). As antigen specific T-cells occur in such low numbers, they are first pre-enriched for CD4. The CD154 distribution is compared to that of an unstimulated sample for that donor, and only events with CD154 greater than the 95th percentile of the unstimlated are included. Among the stimulated cells, the following markers are of interest: IL17, IL22, IFNg, GM-CSF, IL4, IL5 and IL10. T-helper sets are defined as following: Th1 by IFNg, Th2 by IL4 and IL5, Th17 by IL17 and IL22. IL10 (is somehow implicated in) Tregs (regulatory Tcells). Th2 cells are thought/known to be implicated in allergic reactions when they are uncontrolled by Tregs. It is of interest to understand the pathogen specific regulation of coexpression as this points to the underlying transcriptional programmes that regulate cytokine expression.

library(flowCore)
library(pheatmap)
#library(cluster)
library(ggplot2)
#library(reshape)
library(ade4)
#library(ca)
library(vcd)
library(igraph)
library(glasso)
library(plyr)
#library(dunn.text)
#devtools::load_all(".")
library(flowR)

# load data

fn <- list.files('~/2015/2014-12-04_CompareDonorsManuscript/2015-09-24_Rscripts/Tcell/data/')
Data <- lapply(1:length(fn), function(i) read.FCS(paste('~/2015/2014-12-04_CompareDonorsManuscript/2015-09-24_Rscripts/Tcell/data/', fn[i], sep = '')))
metadata <- data.frame(t(mapply(function(x) strsplit(strsplit(x, split = '[.]')[[1]][1], '_')[[1]], fn))[, -1])

Markers <- c('IL10', 'IL22', 'GMCSF', 'IL4', 'IFNg', 'IL5', 'IL17', 'CD154')
markers_index <- c(2, 3, 4, 8, 9, 10, 11, 14)

C <- factor(rep(1:10, 8))

lgcl <- logicleTransform(w = 2, t = 10000, m = 4.5)
L <- transformList(c("b-LP505 525_50-B-A", "b-LP650 675_20-A-A", "r-670_30-C-A", "v-LP555 575_15-B-A", "v-LP600 610_20-A-A", "yg-585_15-E-A", "yg-LP600 610_20-D-A", "yg-LP750 780_60-A-A"), lgcl)
D_comp <- mapply(function(x) compensate(x, description(x)$`$SPILLOVER`), Data, SIMPLIFY = FALSE)
D_comp_lgcl <- mapply(function(x) exprs(transform(x, L)), D_comp, SIMPLIFY = FALSE)



Data <- mapply(exprs, Data)

# only want events with CD154 above background (i.e. compared to unstimulated cells)

q <- lapply(1:8, function(i) quantile(D_comp_lgcl[metadata$X1 == levels(metadata$X1)[i] & metadata$X2 == '1' & metadata$X3 == '2'][[1]][, 14], 0.95))
i = 1
D_comp_cd154 <- lapply(1:8, function(i) {
  tmp <- D_comp_lgcl[metadata$X1 == levels(metadata$X1)[i]][-c(2, 10)]
  tmp <- mapply(function(x) x[x[, 14] > q[i], ], tmp, SIMPLIFY = FALSE)
  return(tmp)
}
)
D_comp_cd154 <- unlist(D_comp_cd154, recursive = FALSE)

D_comp_cd154 <- mapply(function(x) x[, markers_index], D_comp_cd154, SIMPLIFY = FALSE)
for (i in 1:length(D_comp_cd154)) {
  colnames(D_comp_cd154[[i]]) <- Markers
}

D_comp_cd154 <- mapply(as.data.frame, D_comp_cd154, SIMPLIFY = FALSE)

sample_fac <- data.frame(donor = rep(1:8, rep(8, 8)), stim = rep(1:8, 8))
stim <- c('asp', 'can', 'mite', 'bgr', 'cmv', 'ecoli', 'clep', 'inf')
stim_type <- rep(c('lung', 'gut', 'lung', 'lung', 'virus', 'gut', 'gut', 'virus'), 8)
breaks <- make_breaks(mapply(function(x) x[, 1:7], D_comp_cd154, SIMPLIFY = FALSE))


burt <- make_burt(mapply(function(x) x[, 1:7], D_comp_cd154, SIMPLIFY = FALSE), breaks, Markers[1:7])

Basic statistics

Some basic statistics: proportion of at least single positive, at least double positive, at least triple positive, at least quadruple positive. It is most likely a cell will be single positive, and almost impossible to be quadruple positive. In this dataset, the majority of stained cells are single or double positive.

burt_all <- do.call(rbind, burt)
t1 <- apply(burt_all[, 1:7], 2, function(y) by(burt_all$counts, y, sum)[2]/length(burt))
t1


# at least double positive 
t2 <- apply(burt_all[, 1:7], 2, function(x) sum(burt_all$counts[rowSums(burt_all[, 1:7]) > 8 & x == 2])/sum(burt_all$counts))
t2

# at least triple positive 
t3 <- apply(burt_all[, 1:7], 2, function(x) sum(burt_all$counts[rowSums(burt_all[, 1:7]) > 9 & x == 2])/sum(burt_all$counts))
t3

# at least quadruple positive 
t4 <- apply(burt_all[, 1:7], 2, function(x) sum(burt_all$counts[rowSums(burt_all[, 1:7]) > 10 & x == 2])/sum(burt_all$counts))
t4

# library(xtable)
# xtable(cbind(t1, t2, t3, t4), display = rep('e', 5))

Sample level coexpression

Construct a matrix where each row is a donor, and each column corresponds to a marker. Each entry is the logratio of the donor's marker frequency to the average marker frequency. Here, 'marker frequency' corresponds to the proportion of events in a sample that is positive for a marker, and includes single positive events, double positive events, triple positive event etc, and corresponds to a univariate marker histogram. Then it is possible to discern whether a marker is positive more or less frequently than average. By investigating the logratios, the absolute number of events is removed which can be problematic, e.g. when comparing IFNg to IL4/IL5. The final column of the matrix consists of the logratio of frequency of unstained cells in each sample and the average frequency of unstained cells.

The matrix summarises how markers are coexpressed in a sample. It is a bulk measurement that is averaged over all events in a sample. It contains no explicit information about coexpression on the single cell level. However, as the total frequency of each marker is a mixture of single positive and double positive frequencies, it is hidden as a latent process.

The first impression of this bulk matrix is gained by plotting a heatmap of the average logratios within each stimulation.

tot_pos <- apply(burt_all[, 1:7], 2, function(y) by(burt_all$counts, y, sum)[2]/length(burt))

summary_mtx <- 
  t(
sapply(1:8, function(i) {
tmp <- burt[sample_fac$stim == i]
tmp1 <- sapply(tmp, function(.x) apply(.x[, 1:7], 2, function(y) by(.x$counts, y, sum)[2]))
tmp1[is.na(tmp1)] <- 0
rowMeans(tmp1)
}
)
)
colnames(summary_mtx) <- Markers[-8]
rownames(summary_mtx) <- stim
summary_mtx <- log(sweep(summary_mtx, 2, tot_pos, '/'))

pheatmap(summary_mtx)

Between-component analysis on bulk frequency matrix

Perform a between-component analysis (BCA) on the bulk matrix. This is to visualise the relationship between stimulation centres - there could be addtional correlation within each stimulation that makes this relationship less clear. We proceed with a descriptive/exploratory analysis.

sample_coexp <- make_sample_coexp(burt)

L <- log(sample_coexp)
RM <- do.call(rbind, replicate(8,  by(L, sample_fac$donor, colMeans)))
L <- L - RM
colnames(L) <- c(Markers[-8], 'unst')
#rownames(L) <- stim[sample_fac$stim]

pca <- dudi.pca(L, scan = FALSE, nf = 8)
betw <- bca(pca, factor(stim[sample_fac$stim]), scan = FALSE, nf = 6)

# export figure for paper
#pdf('Figures/bulk_bca.pdf', height = 6, width = 10)
par(mfrow = c(1, 2))
s.class(betw$ls, factor(stim[sample_fac$stim]), col = sample_fac$stim)
s.arrow(betw$c1, clabel = 0.6)
#dev.off()

Singular value decomposition of bulk frequency matrix

Dissect pathogen specific contributions in a systematic way using singular value decomposition.

The first split of markers is {IL4, IFNg, IL5, GMCSF, IL10} vs. {unst, IL22, IL17}. These two groups are roughly diametrically opposed to each, i.e when a donor has high counts in one group it is likely (but not certain) to have low counts in the other group.

We can use the ranks of the donor loadings to understand how they contribute to the marker groups. Ecoli and influenza are most opposed. This means, when one pathogen has high counts in one marker group, the other pathogen is likely to have low counts (and vice versa for the other marker group). In this dataset, E.coli overall has low counts in group1 and high counts in group 2. Influenza has high counts in group 1 and low counts in group 2.

Furthermore, CMV joins influenza and candida joins E.coli, making this direction a separation between gut and virus.

L_1 <- scale(L)
rownames(L_1) <- stim[sample_fac$stim]
S <- svd(L_1)
S1 <- svd_residuals(L_1, 1)
S2 <- svd_residuals(L_1, 1:2)

# export figure
# png('Figures/bulk_heatmap.png', height = 400, width = 700)
pheatmap(L_1[order(S$u[, 1]), order(S$v[, 1])], cluster_cols = FALSE, cluster_rows = FALSE, main = 'component 1')
# dev.off()

o <- order(by(S$u[, 1], factor(sample_fac$stim), median))
# export figure
# pdf('Figures/bulk_boxplot_1st.pdf')
par(mfrow = c(1, 1))
boxplot(S$u[, 1] ~ factor(stim[sample_fac$stim], levels = stim[o], ordered = TRUE), ylab = 'loading', main = 'Bulk frequency 1st component' )
# dev.off()

The first singular vectors are only an approximation, now we look for the next strongest patterns. The largest deviation from the first approximation is given by {IFNg, GMCSF} vs. {IL17, IL4, unst, IL5, IL10}. IL22 is indeterminate.

Now the three lung pathogens dominate one end of this direction, by having higher counts in group 2. At the other end, Ecoli and CMV come together, seemingly contradicting the first approximation. But what this means is while CMV and E.coli are strongly separated by IL22 and IL17, and IL4, IL5, GMCSF, they deviate because of IFNg and IL10. In contrast, Influenza and candida are in the middle.

# export figure
# png('Figures/bulkheatmap2nd.png', height = 400, width = 700)
pheatmap(S1[order(S$u[, 2]), order(S$v[, 2])], cluster_cols = FALSE, cluster_rows = FALSE, main = 'Bulk frequency component 2')
# dev.off()

o <- order(by(S$u[, 2], factor(sample_fac$stim), median))
# export figure
# pdf('Figures/bulk_boxplot_2nd.pdf')
par(mfrow = c(1, 1))
boxplot(S$u[, 2] ~ factor(stim[sample_fac$stim], levels = stim[o], ordered = TRUE), main = 'Bulk frequency component 2', ylab = 'loadings')
# dev.off()

Here, gut and lung are opposing each other because of GMCSF and unstained.

pheatmap(S2[order(S$u[, 3]), order(S$v[, 3])], cluster_cols = FALSE, cluster_rows = FALSE, main = 'Bulk frequency component 3')
o <- order(by(S$u[, 3], factor(sample_fac$stim), median))
par(mfrow = c(1, 1))
boxplot(rank(S$u[, 3]) ~ factor(stim[sample_fac$stim], levels = stim[o], ordered = TRUE) ,main = 'Bulk frequency component 3', ylab = 'loadings')
# ## Pathogen centric view
# 
# It is sometimes preferable to take a pathogen-centric view rather than a marker-centric view.
# It can be seen that the gut pathogens are very similar to each other, and birchgrass and mite are very similar. However, aspergillus is close to influenza than the other lung pathogens, and CMV and influenza aren't very close to each other. 
# 
# * IL22, IL17 are most frequently expressed in the gut pathogens. 
# * IL5 is most frequently expressed in the birchgrass/mite. 
# * IFNg is most frequently expressed in CMV. 
# * Influenza, aspergillus 'transition' between CMV and birchgrass/mite. GMCSF is most frequently expressed in influenza, IL4/IL10 are most expressed in aspergillus. 
# 
# 
# 
# 
# pheatmap(cor(t(summary_mtx)))
# 
# pca <- dudi.pca(t(summary_mtx), scannf = FALSE, nf = 4)
# par(mfrow = c(1, 2))
# s.class(pca$li, factor(c(Markers[1:7])))
# s.arrow(pca$c1, clabel = 0.7)

Marker correlation network

The correlation between marker frequencies can alternatively be represented as network if you like that sort of thing. It has basically the same information as the righthand plot above. Orange indicates positive correlation, blue represents negative correlation. The thickness of the line indicates the strength of association.

pc <- glasso(cor(L), rho = 0.1)$wi
colnames(pc) <- colnames(L)
rownames(pc) <- colnames(L)

G <- graph_from_adjacency_matrix(pc, mode = 'undirected', weighted = TRUE, diag = FALSE)
E(G)$col <- 1
E(G)$col[E(G)$weight > 0] <- 2
E(G)$width <- abs(E(G)$weight)
plot(G, edge.color = E(G)$col, edge.width = E(G)$width*15)

Differences between pathogen type: gut, lung and virus

Boxplot and kruskal-wallis test (p value displayed in header of each plot) of each marker with respect to lung, gut and virus stimulation types. Not only does it confirm correlation structure above (which is calculated directly on the eight stimulation types) but it demonstrates a general pattern for the three stimulation types.

The kruskal-wallis test is to determine a difference in medians. IL10 and IL5 don't appear to have any median differences, but it can be observed that the stimulation types have different ranges. For example, for IL10, gut has a greater number of lower observations whereas lung has a greater number of higher observations.

o <- c(8, 1, 6, 4, 3, 5, 2, 7)
par(mfrow = c(2, 4))
for (i in 1:8){
  pval <- signif(kruskal.test(L[, o[i]], factor(stim_type))$p.value, 2)
  boxplot(L[, o[i]] ~ factor(stim_type), main = paste(colnames(L)[o[i]], pval), las = 2)
  #axis(1, at = 1:3, labels = FALSE)
  #text(1:3, par('usr')[3] - 0.2, labels = c('gut', 'lung', 'virus'), srt = 45, pos = 2, xpd = TRUE)
          }

It is also possible to look the BCA of the three categories gut, lung and virus.

pca_gut <- dudi.pca(L[stim_type == 'gut', ], scan = FALSE, nf = 8)
betw_gut <- bca(pca_gut, factor(stim[sample_fac$stim][stim_type == 'gut']), scan = FALSE, nf = 4)
par(mfrow = c(1, 2))
s.class(betw_gut$ls, factor(stim[sample_fac$stim][stim_type == 'gut']))
s.arrow(betw_gut$c1)

par(mfrow = c(2, 4))
for (i in 1:8){
  pval <- signif(kruskal.test(L[stim_type == 'gut', i], sample_fac$stim[stim_type == 'gut'])$p.value, 2)
  boxplot(L[stim_type == 'gut',i] ~ factor(sample_fac$stim[stim_type == 'gut']), main = paste(colnames(L)[i], pval), xaxt = 'n')
  axis(1, 1:3, c('can', 'ecoli', 'clep'), las = 2)
}
pca_lung <- dudi.pca(L[stim_type == 'lung', ], scan = FALSE, nf = 8)
betw_lung <- bca(pca_lung, factor(stim[sample_fac$stim][stim_type == 'lung']), scan = FALSE, nf = 4)
par(mfrow = c(1, 2))
s.class(betw_lung$ls, factor(stim[sample_fac$stim][stim_type == 'lung']))
s.arrow(betw_lung$c1)


par(mfrow = c(2, 4))
for (i in 1:8){
  pval <- signif(kruskal.test(L[stim_type == 'lung', i], sample_fac$stim[stim_type == 'lung'])$p.value, 2)
  boxplot(L[stim_type == 'lung',i] ~ factor(sample_fac$stim[stim_type == 'lung']), main = paste(colnames(L)[i], pval), xaxt = 'n')
  axis(1, 1:3, c('asp', 'mite', 'bgr'), las = 2)
}
pca_virus <- dudi.pca(L[stim_type == 'virus', ], scan = FALSE, nf = 8)
par(mfrow = c(1, 2))
s.class(pca_virus$li, factor(stim[sample_fac$stim][stim_type == 'virus']))
s.arrow(pca_virus$c1)

par(mfrow = c(2, 4))
for (i in 1:8){
  pval <- signif(kruskal.test(L[stim_type == 'virus', i], sample_fac$stim[stim_type == 'virus'])$p.value, 2)
  boxplot(L[stim_type == 'virus',i] ~ factor(sample_fac$stim[stim_type == 'virus']), main = paste(colnames(L)[i], pval), xaxt = 'n')
  axis(1, 1:2, c('cmv', 'inf'), las = 2)
}

Single cell coexpression

Normalised to average number over all samples

A matrix is now constructed with entries consisting of the logratio of the double positive frequency in each sample and the average double positive frequency, for each pair of markers. The double positive frequency potentially includes events which are triple positive, quadruple positive etc. Some combinations are very rare, so if the count is zero in more than 25% of samples, the combination is removed. This matrix is now gives insight to the coexpression at the single cell level.

cell_coexp <- make_cell_coexp(burt, Markers[1:7])
twoD_marginal <- make_twoD_marginal(burt, Markers[1:7])


L2 <- log(cell_coexp)
RM <- do.call(rbind, replicate(8,  by(L2, sample_fac$donor, colMeans)))
L2 <- L2 - RM
#rownames(L2) <- stim[sample_fac$stim]

missing_vals <- apply(L2, 2, function(x) length(which(x == 0)))/length(burt)

Connection between double positive frequencies and bulk frequencies, i.e. cell-level coexpression and sample-level coexpression

A 'null hypothesis' could be that bulk frequencies drive the double positive frequencies. Specifically, it could be expected that markers which are coexpressed at the sample level might also be coexpressed at the single cell level. What are the implications for exceptions to this hypothesis? Under what conditions is a cell subset double positive for two markers, when in general these markers aren't coexpressed in the sample?

To investigate this, for each marker pair, first regress the double positive frequencies on the corresponding pair of bulk frequencies. For example, regress the IL17IL22 double positive frequencies on both IL17 bulk frequencies and IL22 bulk frequencies simultaneously. Then plot the $R^2$ value of this regression against the absolute value of correlation between the bulk frequencies (e.g. correlation between IL17 bulk frequencies and IL22 bulk frequencies). Correlation which is originally negative is plotted in black.

It can be seen that only 3 pairs are well-explained by their bulk frequencies: IL17-IL22, IFNg-GMCSF, and IL4-IL5. These pairs correspond to the Thelper subsets. This is potentially a valuable method for filtering marker combinations when there is a larger number of markers and/or uncharacterised markers.

rsq <- make_rsq(L, L2, Markers[1:7])

# pdf('Figures/rsq_cor.pdf')
par(mfrow = c(1, 1))
plot(abs(rsq$cor)[missing_vals < 0.25], rsq$rsq[missing_vals < 0.25], pch = 16, cex = 1,  type = 'n', xlab = '|correlation coefficient|', ylab = 'R square')
text(abs(rsq$cor)[missing_vals < 0.25], rsq$rsq[missing_vals < 0.25], names(rsq$cor)[missing_vals < 0.25], col = sign(rsq$cor)[missing_vals < 0.25] +2)
abline(lm(rsq$rsq[missing_vals < 0.25] ~ abs(rsq$cor[missing_vals < 0.25])))
# dev.off()

BCA of double positive frequency matrix

Perform a BCA on the double positive frequency matrix. The three categories gut, lung and virus don't necessarily group together, in particular, CMV and influenza have been separated.

Visually inspecting the correlation between all double positive frequencies, there are two strong groups and one weaker group.

L2 <- L2[, missing_vals < 0.25]
pca <- dudi.pca(L2, scan = FALSE, nf = 8)
betw <- bca(pca, factor(stim[sample_fac$stim]), scan = FALSE, nf = 3)
par(mfrow = c(1, 2))
s.class(betw$ls, factor(stim[sample_fac$stim]), col = sample_fac$stim)
s.arrow(betw$c1, clabel = 0.5)

Cross correlation between bulk frequencies and double positive frequencies

It is of interest to know the connection between double positive frequencies and bulk frequencies. In particular, even though there are a large number of possible double positive frequencies, they are constrained to 4 processes in this particular dataset. The double positive groups are also strongly associated with the bulk frequencies in a very constrained way.

#C <- cor(cbind(L, L2))[1:8, 9:22]
C <- make_cross_correlation(L, L2)

S <- svd(C)
S1 <- svd_residuals(C, 1)
# png('Figures/cross_cor_1.png', height = 400, width = 700)
pheatmap(C[order(S$u[, 1]), order(S$v[, 1])], cluster_cols = FALSE, cluster_rows = FALSE, main = 'cross correlation, component 1')
# dev.off()

# png('Figures/cross_cor_2.png', height = 700, width= 1000)
pheatmap(S1[order(S$u[, 2]), order(S$v[, 2])], cluster_cols = FALSE, cluster_rows = FALSE, main = 'cross correlation, component 2')
# dev.off()

These three groups of double positive combinations should be fairly homogenous and therefore very well described by a single principal component - i.e. it should be possible to give the stimulations a straightforward ordering to describe the relative frequencies.

Group 1

L2_1 <- scale(L2[, c(3, 13, 12, 4, 9, 2, 10)])
rownames(L2_1) <- stim[sample_fac$stim]
S <- svd(L2_1)
S1 <- svd_residuals(L2_1, 1)

# pca <- dudi.pca(L2[, c(3, 13, 12, 4, 9, 2, 10)], scan = FALSE, nf = 8)
# betw <- bca(pca, factor(stim[sample_fac$stim]), scan = FALSE, nf = 3)
# par(mfrow = c(1, 2))
# s.class(betw$ls, factor(stim[sample_fac$stim]), col = sample_fac$stim)
# s.arrow(betw$c1, clabel = 0.5)



# png('Figures/doublepos_1_comp1.png', height = 400, width = 700)
pheatmap(L2_1[order(S$u[, 1]), order(S$v[, 1])], cluster_cols = FALSE, cluster_rows = FALSE, main = 'component 1')
# dev.off()

o <- order(by(S$u[, 1], factor(sample_fac$stim), median))
# pdf('Figures/double_1_boxplot_comp1.pdf')
par(mfrow = c(1, 1))
boxplot(S$u[, 1] ~ factor(stim[sample_fac$stim], levels = stim[o], ordered = TRUE) , ylab = 'loadings', main = 'component 1')
# dev.off()


# png('Figures/doublepos_1_comp2.png', height = 400, width = 700)
pheatmap(S1[order(S$u[, 2]), order(S$v[, 2])], cluster_cols = FALSE, cluster_rows = FALSE, main = 'component 2')
# dev.off()

o <- order(by(S$u[, 2], factor(sample_fac$stim), median))
# pdf('Figures/double_1_boxplot_comp2.pdf')
par(mfrow = c(1, 1))
boxplot(S$u[, 2] ~ factor(stim[sample_fac$stim], levels = stim[o], ordered = TRUE) , ylab = 'loadings', main = 'component 2')
# dev.off()

group 2

# pca <- dudi.pca(L2[, c(8, 7, 14)], scan = FALSE, nf = 8)
# betw <- bca(pca, factor(stim[sample_fac$stim]), scan = FALSE, nf = 3)
# par(mfrow = c(1, 2))
# s.class(betw$ls, factor(stim[sample_fac$stim]), col = sample_fac$stim)
# s.arrow(betw$c1, clabel = 0.5)

L2_1 <- scale(L2[,  c(8, 7, 14)])
rownames(L2_1) <- stim[sample_fac$stim]
S <- svd(L2_1)
S1 <- svd_residuals(L2_1, 1)

# png('Figures/doublepos_2_heatmap_comp1.png', height = 400, width = 700)
pheatmap(L2_1[order(S$u[, 1]), order(S$v[, 1])], cluster_cols = FALSE, cluster_rows = FALSE, main = 'component 1')
# dev.off()

o <- order(by(S$u[, 1], factor(sample_fac$stim), median))
# pdf("Figures/doublepos_2_boxplot_comp1.pdf")
par(mfrow = c(1, 1))
boxplot(S$u[, 1] ~ factor(stim[sample_fac$stim], levels = stim[o], ordered = TRUE), main = 'component 1', ylab = 'loadings')
# dev.off()



pheatmap(S1[order(S$u[, 2]), order(S$v[, 2])], cluster_cols = FALSE, cluster_rows = FALSE, main = 'component 2')


o <- order(by(S$u[, 2], factor(sample_fac$stim), median))
par(mfrow = c(1, 1))
boxplot(S$u[, 2] ~ factor(stim[sample_fac$stim], levels = stim[o], ordered = TRUE) , main = 'component 2', ylab = 'loadings')

group 3

# pca <- dudi.pca(L2[, c(11, 6, 5, 1)], scan = FALSE, nf = 8)
# betw <- bca(pca, factor(stim[sample_fac$stim]), scan = FALSE, nf = 3)
# par(mfrow = c(1, 2))
# s.class(betw$ls, factor(stim[sample_fac$stim]), col = sample_fac$stim)
# s.arrow(betw$c1, clabel = 0.5)

L2_1 <- scale(L2[,  c(11, 6, 5, 1)])
rownames(L2_1) <- stim[sample_fac$stim]
S <- svd(L2_1)
S1 <- svd_residuals(L2_1, 1)

#png("Figures/doublepos_3_heatmap_comp1.png", height = 400, width = 700)
pheatmap(L2_1[order(S$u[, 1]), order(S$v[, 1])], cluster_cols = FALSE, cluster_rows = FALSE, main = 'component 1')
#dev.off()

o <- order(by(S$u[, 1], factor(sample_fac$stim), median))
# pdf("Figures/doublepos_3_boxplot_comp1.pdf")
par(mfrow = c(1, 1))
boxplot(S$u[, 1] ~ factor(stim[sample_fac$stim], levels = stim[o], ordered = TRUE), main = 'component 1', ylab = 'loadings' )
# dev.off()


pheatmap(S1[order(S$u[, 2]), order(S$v[, 2])], cluster_cols = FALSE, cluster_rows = FALSE, main = 'component 2')

o <- order(by(S$u[, 2], factor(sample_fac$stim), median))
par(mfrow = c(1, 1))
boxplot(S$u[, 2] ~ factor(stim[sample_fac$stim], levels = stim[o], ordered = TRUE) , main = 'component 2', ylab = 'loadings')
# # Single Cell coexpression
# 
# ## Normalised to expected number given total counts within sample
# 
# We can make another double positive matrix, taking into account the contingency table structure. Suppose in one sample, there are 10% IL17 cells, and 12% IL17 cells. If the cells are double positive at random (i.e. no association between markers), then there would be (0.1)(0.12) = 1.2% cells. In fact, if there are actually 5% cells, then the value for that sample in the IL17-IL22 column would be $log(5/1.2, 2) = 2.06$. This means there are 4 times the number of IL17-IL22 cells than expected, given the total IL17 and IL22 frequencies. 
# 
# First plot BCA. 
# 
# 
# L3 <- log(twoD_marginal)
# RM <- do.call(rbind, replicate(8,  by(L3, sample_fac$donor, colMeans)))
# L3 <- L3 - RM
# L3 <- L3[, missing_vals < 0.25]
# 
# 
# 
# pca <- dudi.pca(L3, scan = FALSE, nf = 8)
# betw <- bca(pca, factor(stim[sample_fac$stim]), scan = FALSE, nf = 3)
# 
# par(mfrow = c(1, 2))
# s.class(betw$ls, factor(stim[sample_fac$stim]), col = sample_fac$stim)
# s.arrow(betw$c1, clabel = 0.5)
# 
# Boxplot and kruskal-wallis test for three stimulation types gut, lung and virus. 
# 
# par(mfrow = c(3, 5))
# for (i in 1:14){
#   pval <- signif(kruskal.test(L3[, i], factor(stim_type))$p.value, 2)
#   boxplot(L3[, i] ~ factor(stim_type), main = paste(colnames(L3)[i], pval), las = 2)
#           }
# Turn straight to the cross-correlation matrix
# 
# * groups are less prominent than in cell coexpression matrix above
# 
# * IL17-IL22 is negatively correlated with IL22, IL17 and unstained. This means, that the fewer IL22, IL17 and unstained cells that are produced, the more preferentially they will be coexpressed (i.e. IL22 and IL17 production is more preferentially directed to being coexpressed and not coexpressed with other markers). Conversely, the more total IL22 and IL17 that is being produced, there is more 'freedom' to combine with other markers. 
# 
# * IFNg-IL19, IL22-IL10 and IL4-IL10 are weakly following the lead of IL17-IL22. 
# 
# * IFNg-GMCSF is negatively correlated with GMCSF and IFNg, and positively correlated with unstained cells. This means, the less total IFNg and GMCSF being produced, the more preferentially they will be coexpressed with each and not with other markers (and vice versa). 
# 
# * IFNg-IL4, IFNg-IL22, GMCSF-IL22, IL17-GMCSF, IL17-IL10, IL17-IFNg are  following the lead of IFNg-GMCSF. 
# 
# * There is another small group: IL5-IL4, IL4-GMCSF, GMCSF-IL10 that is slightly more correlated with IL10, IL4 and IL5 than the group above. 
# 
# # cross correlation matrix
# #C <- cor(cbind(L, L3))[1:8, 9:22]
# C <- make_cross_correlation(L, L3)
# pheatmap(C)
# 
# pheatmap(cor(C))


kristenmf/flowR documentation built on May 20, 2019, 6:15 p.m.