In AllenInstitute/mfishtools: Building Gene Sets and Mapping mFISH Data

This code reads in all of the data for an example mouse SST mFISH experiment and compares it against a reference FACs data set. In this case we have very limited prior knowledge, so I am open to suggestions for how to check whether the results look reasonable..

Workspace set-up

Install the necessary packages. In this case we are using data from tasic2016data and plotting functions from scrattch.vis.

install.packages("devtools")
devtools::install_github("AllenInstitute/tasic2016data") # For our data example

Load libraries.

suppressPackageStartupMessages({
  library(mfishtools)    # This library!
  library(matrixStats)   # For rowMedians function, which is fast
  library(gplots)        # For some of the plots
  library(ggplot2)       # For other plots
  library(tasic2016data) # For the data
})
options(stringsAsFactors = FALSE)  # IMPORTANT
print("Libraries loaded.")

Read in the reference data (in this case we will use the Tasic 2016 data, which includes ~1800 cells from mouse primary visual cortex).

annotations <- tasic_2016_anno
#counts      <- tasic_2016_counts  # uncomment if using CPM below
rpkm        <- tasic_2016_rpkm
annotations <- annotations[match(colnames(rpkm),annotations$sample_name),]  # Put them in the correct order

Read in the mFISH data. An example data set is provided as part of mfishtools, and was loaded with the library. In this case, fishData is a matrix of gene expression levels (e.g., spot counts within a cell) for a given cell, with genes as rows and cells as columns. This is the same format as the RNA-seq data.

dim(fishData)

The metadata variable is a data frame with some specific requirements on column names.

colnames(metadata)

The required column names are as follows (and other column names are perfectly fine):
- area = Area/volume of the cell (or just set to a constant)
- experiment = Name of the experiment or experiments (or just set ot a constant)
- layerData = Numeric call for the layer. Not requred, but useful for plotting an rotating x, y
- x = X coordinate for cell (ideally this is the lateral coordinate)
- y = Y coordinate for cell (ideally this is the laminar coordinate)

Data preparations

This analysis will only be looking at marker genes for GABAergic neurons, so we need to only consider cells mapping to GABAergic types. We also define some convenient cluster info variables here.

clusterType = annotations$broad_type 
includeClas = "GABA-ergic Neuron"  # In this analysis, we are only considering interneurons
excludeClas = sort(setdiff(clusterType,includeClas))
kpSamp      = !is.element(clusterType,excludeClas)
anno        = annotations[kpSamp,]
cl          = annotations$primary_type_label
names(cl)   = annotations$sample_name
kpClust     = sort(unique(cl[kpSamp]))

Convert the data to log2(rpkm). NOTE: we often use counts per million of introns + exons when performing this analysis. Currently, we don't know which method produces more reliable markers. Alternative code for calculating cpm is commented out below.

normDat = log2(rpkm+1)
#sf      = colSums(counts)/10^6
#cpms    = t(t(counts)/sf)
#normDat = log2(cpms+1)
print("Data normalized!")

Calculate proportions, means, and medians. These are all used later for various reasons. One important thing to note is that we are using cl here as a vector of cell type calls for the RNA-seq data (with names corresponding to the column/sample names of the RNA-seq sample data); however, if you wanted to map to a more coarse definition of cell types, or to omit certain cell types, this would be the step to do it. The columns of these summary values calculated here define the cell types that mFISH data will be mapped against for the remainder of the vignette.

exprThresh = 1
medianExpr = do.call("cbind", tapply(names(cl), cl, function(x) rowMedians(normDat[,x]))) 
meanExpr   = do.call("cbind", tapply(names(cl), cl, function(x) rowMeans(normDat[,x]))) 
propExpr   = do.call("cbind", tapply(names(cl), cl, function(x) rowMeans(normDat[,x]>exprThresh))) 
rownames(medianExpr) <- rownames(propExpr) <- rownames(meanExpr) <- genes <- rownames(normDat)  
print("Summary values calculated!")

Consider only genes included in the mFISH experiment that are also present in the RNA-seq reference data set (usually this is all of them).

useGenes <- intersect(rownames(fishData),genes)  # Define genes to be used in the analysis
fishDat  <- fishData[useGenes,]    # Separate out the data from the metadata
print("mFISH data is ready!")

Map the mFISH data!

Do the mapping using the parameters defined below. Currently these are all manually selected and a bit of a guess. I am open to suggestions for how to do this mapping in a more systematic way. The idea behind this method is to attach various filtering and scaling strategies to the mFISH and RNA-seq data sets, and then use correlation-based mapping to find the best fitting cell cluster. My expectation (still to be tested) is that this strategy will be useful for smaller gene panels, but that other more computational-based strategies will be most effective for larger gene panels.

qprob    <- 0.9     # Parameter for scaling mFISH to FACS
thresh   <- 3       # Set counts less than or equal to thresh to 0          1
log2p1   <- function(x) return(log2(x+1))  # log transform function
binarize <- FALSE   # Should the data be binarized?
weights  <- NULL    # Integer weights.  SET TO NULL IF YOU DON'T KNOW WHAT YOU ARE DOING!
#weights  <- round(rowSums(fishDat)/min(rowSums(fishDat)))  # Here is how to weight roughly by average expression level

fishMouse <- fishScaleAndMap(mapDat=fishDat, refSummaryDat=medianExpr, 
  mappingFunction = cellToClusterMapping_byCor, transform = log2p1, noiselevel = thresh, 
  genesToMap = useGenes, metadata = metadata, qprob=qprob, binarize=binarize,
  omitGenes = NULL,integerWeights=weights)
print("Data is mapped.")

View the mapping results

First rotate the X and Y coordinate space to that layer 2 is parallel to the X axis. This won't be perfect since the tissue is not perfectly linear, but it will make for easier viewing of the data. This requires some meta-data variable which marks a subset of genes roughly along align (e.g., cortical layer calls).

rotateAxis <- fishMouse$metadata$layerData==4
flipVector <- fishMouse$metadata$layerData
for (e in unique(fishMouse$metadata$experiment)){
  subset  <- fishMouse$metadata$experiment==e
  fishMouse <- rotateXY(fishMouse,rotateAxis,flipVector,subset)
}

Now plot the location of all of the cells in the tissue section.

sc  <- function(n,...)  return(c("brown","pink","orange","turquoise","blue","green")[1:n])  # Standard colors without yellow
lay <- as.character(fishMouse$metadata$layerData)
plotDistributions(fishMouse,group = "experiment", xlab="Mouse - All cells", ylab="Layer",
                  colors=lay, colormap=sc,maxrow=8,cex=1)

Show the cell distribution across all types in the tissue.

allClusts  <- colnames(medianExpr)
plotDistributions(fishMouse,group = "Class", groups = allClusts, colors = lay, pch=lay, xlab="Mouse", 
                  ylab="Layer",colormap=sc,maxrow=8)

There are some broad observations that suggest we are not too far off. First, nearly all of the cells map to inhibitory types as expected. Second, many of the cell types have layer signatures, with SST/PVALB types more likely to be in deep layers and other inhibitory types more likely to be in upper layers.

Now let's plot the heatmap to see how the data looks when clustered along the tree like this. First, unscaled data, capped at 10 counts.

cap = 10
colorset = c("darkblue", "dodgerblue", "gray80", "orange", "orangered")
heat_colors <- colorRampPalette(colorset)(1001)
plotHeatmap(fishMouse,main="Mouse cells (unscaled, cap=10)",group="Class",groups=allClusts,capValue=cap,
                             colormap=heat_colors,rowsep=NULL,dendrogram="row",Rowv = TRUE,margins = c(8,6))

Next, let's plot the scaled heatmaps for comparison. This is the data that is used in the mapping.

cap = 8
colorset = c("darkblue", "dodgerblue", "gray80", "orange", "orangered")
heat_colors <- colorRampPalette(colorset)(1001)
plotHeatmap(fishMouse,main="Mouse cells (scaled, cap=8)",group="Class",groups=allClusts,capValue=cap,colormap=heat_colors,
                             rowsep=NULL,dendrogram="row",Rowv = TRUE,margins = c(8,6),useScaled=TRUE)

A few genes have quite low expression and probably should be omitted from future experiments. Otherwise, by eye things look reasonable because there do appear to be distinct expression patterns across clusters (although it is really hard to tell by eye, and some of the cells in different blocks seem like they should be grouped together).

Quantitative sanity check

So far we have focused on getting the results and trying to determine agreement with expections based on resulting plots. The results seem mixed so far. We now want to do what I am calling a quantitative sanity check, where we compare results obtained by different computational methods, or between RNA-Seq and mFISH, or taking prior knowledge into consideration. Ideally we can build a mapping alorithm that adjusts parameters to try and optimize the results based on priors (or something to this effect), but for now we want to have quick ways of seeing what looks right and what looks wrong to help us make adjustments.

First, let's see whether the proportion of cells identified in RNA-seq and mFISH agree. Note that we would only expect this to be the case when we have unbiased surveyed in both modalities, which is not the case here.

countFish9 <- table(fishMouse$mappingResults$Class)
countFish9 <- countFish9[countFish9>0]
countSeq  <- table(anno$primary_type_label)
fs <- intersect(names(countFish9),names(countSeq))
plot(as.numeric(countSeq[fs]),as.numeric(countFish9[fs]),main="Sst types",
  pch=19,col="grey",ylab="FISH cell count",xlab="RNA-seq cell count",xlim=c(0,max(as.numeric(countSeq[fs]))*1.1))
text(as.numeric(countSeq[fs]),as.numeric(countFish9[fs]),fs,cex=0.7,srt=0)

There is not great agreement between modalities. What if we wrap up by class?

# Second, broad class
val = list(fishMouse$mappingResults$Class,anno$primary_type_label)
for (i in 1:2){
  val[[i]] <- as.character(lapply(val[[i]], function(x) strsplit(x," ")[[1]][1]))  # Get the class within interneurons
}
countFish <- table(val[[1]])
countSeq  <- table(val[[2]])
fs <- intersect(val[[1]],val[[2]])
plot(as.numeric(countSeq[fs]),as.numeric(countFish[fs]),main="Broad classes",
  pch=19,col="grey",ylab="FISH cell count",xlab="RNA-seq cell count",
  xlim=c(0,max(as.numeric(countSeq[fs]))*1.1),ylim=c(0,max(as.numeric(countFish[fs]))*1.1))
text(as.numeric(countSeq[fs]),as.numeric(countFish[fs]),fs,cex=0.7,srt=0)

Wrapping up by class, there is much better agreement between proportions across modalities. Whiel this is not necessarily something we'd expect to find, it is still useful to note.

One possibility (that I think is likely the case for at least a subset of cells) is that there is some mapping issues. We can try and understand why cells are mapping specific ways by plotting expression levels between RNA-seq and mFISH for each cell type. Let's do that.

meanFish    <- summarizeMatrix(fishMouse$scaleDat,fishMouse$mappingResults$Class,summaryFunction = mean)[useGenes,]
meanFish    <- meanFish[,intersect(colnames(meanFish),colnames(medianExpr))]
medSeq      <- meanExpr[useGenes,colnames(meanFish)]
corFS       <- NULL
par(mfrow=c(4,7))
for (ct in intersect(allClusts,colnames(medSeq))){
  corMR <- signif(cor(medSeq[,ct],meanFish[,ct]),3)
  plot(medSeq[,ct],meanFish[,ct],xlab="RNA-seq",ylab=ct,pch=19,col="white", main=paste("R =",corMR))
  text(medSeq[,ct],meanFish[,ct],rownames(medSeq),cex=0.9)
  corFS <- c(corFS,cor(medSeq[,ct],meanFish[,ct]))
}
names(corFS) <- intersect(allClusts,colnames(medSeq))

Overall, there is quite good agreement for most, but not all, of these cell types. Similarly, this type of plot can give us a sense of which cell types we can be confident in and which ones we can't, although the exact relationship between confidence of mapping and mapping accuracy remains TBD.

Now make a TSNE plot on all genes using the mFISH data. Do the data cluster by color, and show the points also corresponding to cluster (abbreviated).

cap=10
fishPlot <- fishMouse 
fishPlot$mappingResults$Broad <- val[[1]]
fishPlot <- filterCells(fishPlot,fishPlot$mappingResults$Class!="none")
p=plotTsne(fishPlot,main="Mouse cells (scaled)",colorGroup="Broad",labelGroup = "Broad",
           capValue=cap, useScaled=TRUE, perplexity = 10, maxNchar=5)
p

Overall it seems that most cells map reasonably well by broad class calls, although it looks like there are a few errors.

Finally, repeat but with the points corresponding to layer.

p=plotTsne(fishPlot,main="Mouse Sst cells (scaled)",colorGroup="Class",labelGroup = "layerData",
           capValue=cap, useScaled=TRUE, perplexity = 10)
p