Introduction

The MEMA package provides functions to preprocess, QA, analyze and explore data from the National Institute of Health's MEP-LINCS project. MEP-LINCS uses immunofluorescent imaging to interogate cellular responses to different microenvironments and drugs. A microenvironment perturbation or MEP is a combination of a spot of insoluble extracellular matrix proteins with media containing growth factors, ctyokines, ligands and drugs.

This vignette is based on the raw data from the MEP-LINCS experiment that uses Human Epithelial Mammary Cells (HMEC) from a healthy 19 year old woman, cell line HMEC240L. The vignette loads one 8 well plate of raw data and its metadata from the Synapse website, preprocesses them for downstream analysis, summarizes them to the spot level and displays some Quality Assessment figures.

Setup

We start with loading the necessary packages.

library(MEMA)
library(synapseClient)
library(parallel)
library(stringr)
suppressPackageStartupMessages(library(dplyr))
library(ggplot2)

Synpase Login

The first step is to provide login credentials to Synapse from a configuration file previously stored in the default location “~/.synapseConfig”. An alternative is to provide a Synapse username and password as arguments in this command. More information about logging into and using Synapse is available at http://docs.synapse.org/articles/getting_started.html#logging-into-synapse.

suppressMessages(synapseLogin())

Download Metadata

Each plate in the dataset has a unique ID that is encoded in a barcode label attached to the plate. We process each plate independently enabling sequential processing on smaller systems and parallel processing on larger systems.

We next define a query that uses file annotations to locate the plate's metadata on Synapse. We run the query and use the results to download the metadata to a local file.

barcode <- "LI8X00651"
metadataq <- sprintf("select id from syn9612057 WHERE DataType='Metadata' AND Barcode='%s'",barcode)
metadataTable <- synTableQuery(metadataq, verbose=FALSE)@values
metadataFiles <- lapply(metadataTable$id, synGet)
metadataFiles <- list(annotMetadata=getFileLocation(metadataFiles[[1]]))

Next we read and process the metadata.

metadata <- getMetadata(metadataFiles, useAnnotMetadata = TRUE)

The metadata is stored in a data.table where each row is a spot in the plate. The metadata columns are features that describe the microenvironment at each spot, imaging information, position labels and other information that is useful in the downstream analysis. There are r nrow(metadata) spots and r ncol(metadata) features in the metadata. The head of some important columns are shown below.

head(metadata[,.(Barcode, Well, Spot, Ligand, ECMp, CellLine, StainingSet)])


Download Raw Data

In a similar fashion, we create a query to identify the raw data files on Synapse, run the query and use the results to download all of the raw data files. Annotations such as the barcode, data level, the file name and its local path are saved in the dataBWInfo data.frame.

q <- sprintf("select id,name,Barcode,Level,Well,StainingSet,Location from syn9838977 WHERE Level='0' AND Barcode='%s'", barcode)
rawFiles <- synTableQuery(q)
dataBWInfo <- rawFiles@values

# Download raw files, or get from cache if already downloaded
res <- lapply(dataBWInfo$id, synGet)

# Get the path on disk to each file
cellDataFilePaths <- unlist(lapply(res, getFileLocation))
dataBWInfo$Path <- cellDataFilePaths

dtL <- getCPData(dataBWInfo = dataBWInfo, verbose=FALSE)

The raw data is stored in a list of data.tables. Each data.table holds cell level data for one well. Each row is a cell and each column is a feature measured by the image segmentation pipeline or positional information that identifies the cell.

The head of some of the important data from the first well is shown below.

head(dtL[[1]][,.(Nuclei_CP_Intensity_MedianIntensity_Dapi,Nuclei_CP_AreaShape_Area)])

Filter Debris and Clusters

We next filter out rows that are small debris or large multicell clusters that did not segment properly.

dtL <- lapply(dtL, function(dt){
  filterObjects(dt,nuclearAreaThresh = 50, nuclearAreaHiThresh = 4000)})

Add Population Information

Each spot in a MEMA has the potential to grow a population of heterogeneous cells. The following operations add population level data for cell locations, median-centered intensities and neighborhood information that can be used to quantify the populatoin's heterogeneity.

# Add local XY and polar coordinates
dtL <- lapply(dtL, addPolarCoords)

# Add spot level normalizations for median intensities
dtL <- lapply(dtL,spotNormIntensities)

# Add adjacency values
dtL <- lapply(dtL, calcAdjacency)

#Clean up legacy issues in column names and some values
dtL <- lapply(dtL, cleanLegacyIssues)

Merge All Data and Metadata

The raw data is next merged with its metadata and data for all wells in the plate are stored in one cell level data.table.

cDT <- merge(rbindlist(dtL),metadata,by=c("Barcode","Well","Spot"))

Gate Cells

Depending on the staining set, cells can be gated and labeled. For instance, in unperturbed population the total DAPI histogram is bimodal with peaks at the G0/G1 and G2 populations. This signal is autogated by setting a threshold between the peaks and labeling cells as 2n or 4n DNA. If it is in the staining set, the EdU signal is also autogated and the cells above the threshold are labled as EdU positive.

# Gate cells where possible
cDT <- gateCells(cDT)

View DNA Distrbutions

The histogram of the total DAPI signal in the collagen 1 spots is used to assess the quality of the DAPI staining. In unperturbed cells, the disturbution is expected to have its largest peak at the DNA 2n population and a smaller peak at the 4n population.

dt <- cDT[grepl("COL1",cDT$ECMp),]
dt <- dt[,TotalDNANormed := Nuclei_CP_Intensity_IntegratedIntensity_Dapi/median(Nuclei_CP_Intensity_IntegratedIntensity_Dapi[Nuclei_PA_Cycle_State==1]), by="Barcode,Well"]
dt <- dt[,DNAThresh := min(TotalDNANormed[Nuclei_PA_Cycle_State==2]), by="Barcode,Well"]
dt <- dt[,Well_Ligand :=  paste(Well,Ligand, sep="\n")]
binwidth=.04

  p <- ggplot(dt, aes(x=TotalDNANormed))+
    geom_histogram(binwidth = binwidth)+
    coord_cartesian(x=c(0,4))+
    geom_vline(data = dt, aes(xintercept = DNAThresh), colour = "blue")+
    facet_wrap(~Well_Ligand, nrow=2) +
    ylab("Count")+xlab("Total DAPI Intensity Normalized to 2n")+ggtitle(paste("Total DAPI Intensity:",barcode))+
    theme(axis.text.x = element_text(angle = 0, vjust = 0, hjust=0.5, size=rel(.5)), axis.text.y = element_text(angle = 0, vjust = 0.5, hjust=1, size=rel(1)), plot.title = element_text(size = rel(.8)),legend.text=element_text(size = rel(.3)),legend.title=element_text(size = rel(.3)),strip.text.x = element_text(size = rel(.4)))
  suppressWarnings(print(p))

Level 1 Data

We now have the level 1 data where each row in the dataset is a cell. The first few rows of some selected features of the level 1 data are shown in the table below.

head(cDT[,.(ObjectNumber, Barcode, Well, Spot, MEP, Nuclei_CP_Intensity_IntegratedIntensity_Dapi)])

Count Cells at Each Spot

The next steps prepare for summarizing the cell level data to the spot level. First, the cells at each spot are counted.

cDT <- cDT[,Spot_PA_SpotCellCount := .N,by="Barcode,Well,Spot"]

Calculate Gated Populations

Signals can be gated into binary or multi-variate gates. MEMA data always includes DAPI staining of the DNA in the nuclei so this signal is always present. Other signals depend on the staining set. After gating, the proportion of cells in the gates are calculated and added to the data.

#Add proportions for signals with multivariate gating and non-conforming gate values
addSpotProportions(cDT)

#Calculate proportions for binary gated signals
gatedSignals <- grep("Proportion", grep("Positive|High",colnames(cDT), value=TRUE), value=TRUE, invert=TRUE)
if(length(gatedSignals)>0){
  proportions <- cDT[,lapply(.SD, calcProportion),by="Barcode,Well,Spot", .SDcols=gatedSignals]
  setnames(proportions,
           grep("Gated",colnames(proportions),value=TRUE),
           paste0(grep("Gated",colnames(proportions),value=TRUE),"Proportion"))
}

Summarize to Spot Level

The next step is to summarize the cell level data to the spot level. For some of the signals, the standard error of the mean is calculated so that error bars of the replicates can be displayed.

#Define the regex patterns for the signals that get standard error values.
seNames=c("DNA2N","SpotCellCount","EdU","MitoTracker","KRT","Lineage","Fibrillarin")

#median summarize the rest of the signals to the spot level
signals <- summarizeToSpot(cDT, seNames)

if(exists("proportions")) {
  spotDT <- merge(signals,proportions)
} else {
  spotDT <- signals
}

#Set a threshold for the loess well level QA Scores
if(sum(c("ArrayRow","ArrayColumn") %in% colnames(spotDT))==2) spotDT <- QASpotData(spotDT, lthresh = .6)

Level 2 Data

We now have the level 2 data where each row is data from one MEMA spot. The first few rows of some selected features of the level 2 data are shown in the table below.

head(spotDT[,.(Barcode, Well, Spot, MEP, Spot_PA_SpotCellCount)])

Quality Assessment

The variance of the signal in MEMA data comes from biological and technical factors. The technical factors create regions of low cell counts per spot and uneven staining across the array. The goal of the QA pipeline is to quantify the technical factors to identify wells or plates that need to be removed from downstream processing.

The hypothesis for the MEMA QA process is that the biological signal comes from individual spots while the technical variations come from regions of low signal. A bivariate loess model can be used to quantify the number of spots in low signal regions, leading to a MEMA QA score.

The loess model of a MEMA is the mean value of a weighted version of each spot's region or neighborhood. In a 700 spot array, a loess span value of 0.1 sets the size of the neighborhood to be the nearest 70 points (within approximately 5 spots in all directions). The weights are a tricubic function of the euclidean distance between the spot being modeled and the neighborhood spots. These weights vary from 1 to 0 as distances increase from the nearest to the farthest neighbor. In other words, each spot in the model takes on the mean value of its 70 nearest neighbors with the closest neighbors having the largest impact. Therefore, the loess model is dominated by the technical regional factors as opposed to individual biological responses.

A MEMA's QA score is derived from the loess model by calculating the proportion of spots in low signal regions(LSR). A threshold for classifying spots as LSR is based on the median of each well. To have higher scores reflect increasing quality, the MEMA QA score is defined as the proportion of non-LSR spots to total spots. This value will be 1 for MEMAs with no low signal regions and approach 0 as the number of LSR spots increases.

The LSR spots are those to the left of the blue vertical line in the histogram.

#Set a threshold for the loess well level QA Scores
spotDT <- QASpotData(spotDT, lthresh = .6)
dt <- spotDT[,Well_Ligand :=  paste(Well,Ligand, sep="\n")]

plotSCCHeatmapsQAHistograms(dt, barcode,lthresh = 0.6)

We now have spot level data with its metadata. In the MEP-LINCS Preprocessing Pipeline, this data is stored on Synapse or a local server as Level 2 data. The rest of the preprocessing and analysis is in the Preprocess-Level3and4-vignette.html in the MEMA package.

Session Info

This vignette was created with the following session info.

sessionInfo()


MEP-LINCS/MEMA documentation built on May 20, 2019, 2:47 p.m.