knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(ggplot2)
library(readr)
library(stringr)
library(pheatmap)
library(RColorBrewer)
library(cowplot)
library(magrittr)
devtools::load_all()

patchSeqQC

This package includes functions for assessing the quality of single-cell transcriptomic data using cell type-specific marker gene expression, based on our paper, Data-driven approaches for improving the interpretability of patch-seq data. All of the code and data for reproducing the figures from the manuscipt is available at the patchSeqQCManuscript repository.

These functions were initially tailored for data collected using the patch-seq methodology, but are applicable to other single-cell RNAseq datasets. This package is meant as a tutorial to illustrate the approach outlined in our paper, but please contact us with feature requests and bug reports.

Installation

Use devtools to install. Additional github packages needs to be installed for it work.

devtools::install_github('PavlidisLab/patchSeqQC')

This package requires the following packages:

install.packages('ggplot2')
install.packages('pheatmap')
install.packages('dplyr')
install.packages('knitr')

Usage

Example patch-seq data

This package includes sample data from Cadwell et al. 2016, which contains patch-seq data from layer 1 inhibitory interneurons and a small number of pyramidal cells.

cadwell_meta includes basic sample metadata about each single-cell sample

data(cadwell_meta)
knitr::kable(head(cadwell_meta, 3))

sample_id: name of the samples. This needs to correspond to column names in the expression file.

major_type: Primary cell type names provided by Cadwell2016

colors: Unique color for cell type

read_count: Count of number of reads

ercc_count: Count of number of reads mapping to ERCC spike-ins

ercc_pct: Percent of reads mapping to ERCC spike-ins (relative to all sequenced reads)

cadwell_expr includes gene expression data for these samples (stored as TPM + 1, transcripts per million)

data(cadwell_expr)
knitr::kable(head(cadwell_expr[, 1:7], 3))

Visualizing cell type-specific marker expression

The central idea behind our single-cell quality metrics is that a high-quality single cell should strongly express markers of its own cell type and not express cell-type specific markers of other cell types. For example, we would expect that an cortical inhibitory neuron should strongly express GABAergic markers, like Gad2, Gad1, and not express markers of other cell types, such as markers of microglia like Tyrobp or C1qc.

Cell type specific marker genes

To obtain cell type specific markers, we make use of mouse cortex-based single cell reference datasets which are comparable to the cell types profiled by in the Cadwell patch-seq dataset. Here, we provide pre-computed gene markers, markers, based on the Tasic2016 and Zeisel2015 reference datasets.

When defining the markers, we have made the distinction between “on” and “off” markers. "On" markers, are genes that are highly expressed in the cell type of interest with enriched expression relative to other cell types. The second class, "Off" markers, are expected to be expressed at low levels in a given patch-seq cell type. These are genes that are specifically expressed in a single cell type (e.g., microglia) and, if expressed, are an indicator of possible cellular contamination.

data(markers)
lapply(markers,head, 5)

Associating single-cells to standardized cell type names

To make use of these marker genes, we need to associate each single cell from the patch-seq dataset to one of the "On" marker types listed above. Here, we define a column contam_type in the cadwell_meta data frame and assign all of the interneurons from the Cadwell dataset to the cell type "Ndnf", and the excitatory cells to the "Pyramidal" type.

cadwell_meta$contam_type = 'Ndnf'
cadwell_meta[str_detect(cadwell_meta$major_type, 'Exc'), 'contam_type'] = 'Pyramidal'

knitr::kable(head(cadwell_meta %>% dplyr::select(sample_id, major_type, contam_type)))

Visualizing expression of multiple cell type markers

Below, we use the function plotMarkerHeatmap to plot the expression of cell type specific markers of the elongated neurogliaform cells (eNGC) and pyramidal cells from the cadwell dataset. We also define the gene expression data frame, cadwell_engc_df, which stores both metadata and gene expression values for the eNGC cells.

# plot marker expression of cadwell eNGCs

cadwell_engc_samples = cadwell_meta %>% 
  dplyr::filter(major_type == 'eNGC') %>% 
  dplyr::pull(sample_id)  # get sample ids for eNGC cells

cadwell_engc_df = cadwell_expr[, cadwell_engc_samples]

cadwell_engc_df = cbind(cadwell_meta[cadwell_meta$sample_id %in% cadwell_engc_samples, ], 
                        cadwell_engc_df  %>% t()) # combines metadata and expression matrices and transposes it
rownames(cadwell_engc_df) = cadwell_engc_samples

plot_cell_types = c('Ndnf_on', 'Pyramidal', 'Microglia') # define which cell types to plot in heatmap
plot_marker_list = c(markers[plot_cell_types])

cadwell_engc_heatmap = plotMarkerHeatmap(markerlist = plot_marker_list, # named list of lists defining which markers to show
                                         expr_matrix = cadwell_engc_df, # data frame that combines gene expr with metadata, each row is a single-cell
                                         show_legend = T, # show color bar ?
                                         show_cell_labels = T # show sample names in heatmap (defined by rownames of expr_matrix)
                                         )

# plot expression of cadwell pyramidal cells

cadwell_pyr_samples = cadwell_meta %>% dplyr::filter(major_type == 'Exc') %>% dplyr::pull(sample_id) 
cadwell_pyr_df = cadwell_expr[, cadwell_pyr_samples]

cadwell_pyr_df = cbind(cadwell_meta[cadwell_meta$sample_id %in% cadwell_pyr_samples, ], cadwell_pyr_df  %>% t())
rownames(cadwell_pyr_df) = cadwell_pyr_samples

plot_cell_types = c('Pyramidal_on', 'Inhibitory', 'Microglia')
plot_marker_list = c(markers[plot_cell_types])

cadwell_pyramidal_heatmap = plotMarkerHeatmap(plot_marker_list, cadwell_pyr_df, show_legend = T, show_cell_labels = T )

Comparing marker expression in patch-seq data to dissociated cell reference data

Loading cell dissociation-based single cell reference dataset

We've provided a small version of the Tasic2016 single cell cortical reference dataset (with expression of only marker genes), stored in the variable aibsExprData and quantified as TPM + 1. Metadata for these samples is available in aibsExprData.

data(aibsExprData) # need to show that we have a field major types and contam_Type in here as well
knitr::kable(head(aibsExprData[1:5, 1:5]))

data(aibsExprMeta) 
knitr::kable(head(aibsExprMeta, 4))

aibsExprDataDf = cbind(aibsExprMeta, aibsExprData  %>% t()) # combines metadata and expression matrices 
rownames(aibsExprDataDf) = aibsExprMeta$sample_id

sample_id: name of the samples. This needs to correspond to column names in the expression file.

aibs_type: Primary cell type names provided in Tasic2016

norm_sub_types: Cell type subtype names

norm_broad_types: Cell type broad names

Comparing expression of individual marker genes in patch-seq samples to dissociated cells

After loading in the data from the Tasic2016 data, we can compare at the expression of cell type specific markers (for various cell types) to the same markers in patch-seq samples, Below, we use the function plotPatchSeqVsDissocMarkerExpr for this purpose, which on the y-axis shows the average expression of markers across Ndnf cells in the Tasic 2016 data and the x-axis shows expression of markers in a single patch-seq sample (one marker gene = one dot).

cell_types_to_plot = c('Ndnf_on', 'Astrocyte', 'Endothelial', 'Microglia', 'Oligodendrocyte', 'OPC', 'Pyramidal')

plotPatchSeqVsDissocMarkerExpr(patch_seq_df = cadwell_engc_df, # patch-seq data frame that combines gene expr with metadata, each row is a single-cell
                                dissoc_cell_df = aibsExprDataDf, # dissociated cell data frame that combines gene expr with metadata, each row is a single-cell
                                contam_type = 'Ndnf', # cell type to plot
                                plot_cell_sample_ids = c('ERR1156978', 'ERR1156984'), # sample_ids to show
                                plot_marker_types = cell_types_to_plot, # cell types to show markers for
                                markers = markers # named list of lists for marker genes
                               ) 

Summarizing cell type specific marker expression in patch-seq data

For each cell in the patch-seq dataset, we'll use the cell type specific marker genes to calculate summaries of each cell type's marker expression using the function calcContamAllTypes. Each column is named a major cell type (i.e., a major cell type with defined markers in markers) and stores the sum of each cell's expression of each cell type marker (in units log2 (TPM + 1)).

cadwell_contam = calcContamAllTypes(cadwell_engc_df, markers) 
cadwell_contam$colors = "indianred1"
cadwell_contam$major_type = cadwell_engc_df$major_type
cadwell_contam$contam_type = "Ndnf_on"

knitr::kable(head(cadwell_contam[1:8], 3))

Summarizing cell type specific marker expression in reference data

We'll use the same function calcContamAllTypes to calculate summarized marker expression for each cell in the dissociated cell based Tasic2016 reference data, stored in the variable aibsExprDataDf. Please note that it's essential that both the patch-seq data and the reference data are normalized using the same metric, TPM + 1 (transcripts per million) here.

aibs_ndnf_contam = calcContamAllTypes(aibsExprDataDf[str_detect(aibsExprDataDf$aibs_type, 'Ndnf'), ], markers) 
aibs_ndnf_contam$colors = 'indianred1'
aibs_ndnf_contam$contam_type = 'Ndnf_on'

aibs_pyr_contam = calcContamAllTypes(aibsExprDataDf[str_detect(aibsExprDataDf$aibs_type, 'L'), ], markers) 
aibs_pyr_contam$colors = "turquoise"
aibs_pyr_contam$contam_type = 'Pyramidal'

aibs_contam = rbind(aibs_ndnf_contam %>% dplyr::select(Ndnf_on, Pyramidal, colors, contam_type), 
                    aibs_pyr_contam %>% dplyr::select(Ndnf_on, Pyramidal, colors, contam_type)
)

We can use the function plotPatchSeqDissocCellMarkerExpr to plot the summarized marker expression for both the patch-seq and the reference datasets (one dot = one cell).

plotPatchSeqDissocCellMarkerExpr(patch_seq_contam_df = cadwell_contam,
                                  dissoc_cell_contam_df = aibs_contam, 
                                 on_type = 'Ndnf_on', 
                                 off_type = 'Pyramidal')

Move this somewhere else

aibsExprDataDf$contam_type = aibsExprDataDf$norm_broad_type
aibs_contam_all_broad = calcContamAllTypes(aibsExprDataDf, markers)
aibs_contam_all_broad$contam_type = factor(aibsExprDataDf$contam_type)
aibs_med_exprs_broad = aibs_contam_all_broad %>% dplyr::group_by(contam_type) %>% dplyr::summarize_all(median) %>% as.data.frame()
rownames(aibs_med_exprs_broad) = aibs_med_exprs_broad$contam_type

aibsExprDataDf$contam_type = paste0(aibsExprDataDf$norm_sub_type, '_on')
aibs_contam_all_sub = calcContamAllTypes(aibsExprDataDf, markers)
aibs_contam_all_sub$contam_type = factor(aibsExprDataDf$contam_type)
aibs_med_exprs_sub = aibs_contam_all_sub %>% dplyr::group_by(contam_type) %>% dplyr::summarize_all(median) %>% as.data.frame()
rownames(aibs_med_exprs_sub) = aibs_med_exprs_sub$contam_type

aibs_med_exprs = rbind(aibs_med_exprs_broad, aibs_med_exprs_sub[c('Ndnf_on', 'Sncg_on', 'Pvalb_on', 'Pyramidal_on'), ])

compare_cell_types_inh = setdiff(names(markers), c('Ndnf_on', 'Pvalb_on', 'Sncg_on', 'Pyramidal_on', 'Inhibitory')) %>% make.names()
compare_cell_types_exc = setdiff(names(markers), c('Pyramidal_on', 'Pvalb_on', 'Sncg_on', 'Ndnf_on', 'Pyramidal')) %>% make.names()

exc_cell_types = c('CA1-PYR', 'Exc', 'RS-PYR', 'BS-PYR')

Computing single-cell quality metrics based on comparison to reference data

We have developed a number of metrics to evaluate the quality of single cell samples by comparison to dissociated cell reference data that can be calculated using the function calculatePatchSeqQCMetrics.

cadwell_engc_qc_df = calculatePatchSeqQCMetrics(cadwell_engc_df, comparison_dataset = 'aibs', markers)

knitr::kable(head(cadwell_engc_qc_df %>% dplyr::select(sample_id:quality_score)))

sample_id: name of the samples.

major_type: cell type identities (provided by Cadwell2016)

contam_type: cell type identities (normalized to cell type names in markers)

marker_sum: Summed expression of "On" cell type marker genes (with cell type defined by contam_type)

marker_sum_norm: Normalized summed expression of "on"-type marker genes, normalized to median expression of same cell type in dissociated-cell reference data

contam_sum: Contamination score, defined as the sum of normalized expression across all "off" cell types defined in compare_cell_types_inh

quality_score: Quality score, defined as the Spearman correlation of marker expression between markers expressed in single cell with mean expression of markers in dissociated cell reference data

This function also outputs normalized expression of each "off"-cell type (defined in compare_cell_types_inh) and we can use the function plotContamHeatmap to show these (each column is one single cell)

knitr::kable(head(cadwell_engc_qc_df %>% dplyr::select(sample_id, Astrocyte:Pyramidal)))

contam_plot_types = factor(c('Astrocyte', 'Endothelial', 'Microglia', 'Oligodendrocyte', 'OPC', 'Pyramidal'), levels = c('Astrocyte', 'Endothelial', 'Microglia', 'Oligodendrocyte', 'OPC', 'Pyramidal'))
contam_mat = getContamMatrixFromPatchSeq(cadwell_engc_qc_df, 'eNGC', contam_show_types = contam_plot_types)
summarized_contam_heatmap = plotContamHeatmap(contam_mat, show_cell_labels = T)

We plot the marker_sum, contam_sum and quality_score variables below

p1 = ggplot2::ggplot(data = cadwell_engc_qc_df, aes(x = marker_sum, y = contam_sum)) + 
  geom_point() + geom_smooth(method = "lm", se = F)
p2 = ggplot2::ggplot(data = cadwell_engc_qc_df, aes(x = marker_sum, y = quality_score)) + 
  geom_point() + geom_smooth(method = "lm", se = F)

plot_grid(p1, p2, nrow = 1)

Caveats in approach

Our single-cell quality metrics are dependent on the following factors: 1) being able to confidently assign a single-cell's major cell type, ideally without using its transcriptomic data. 2) the availability of high-quality single-cell reference data to use for comparing the expression of observed marker genes to expected marker genes.

We note that there are likely situations where either or both of these do not hold, for example, if the cells profiled have unknown types or if reference data is not available (for example, patch-seq data collected from a brain region without adequate cell-dissociation based single-cell atlases).



PavlidisLab/patchSeqQC documentation built on May 20, 2019, 9:13 p.m.