title: "README for patchSeqQC package" author: "Shreejoy Tripathy" date: "23/04/2018" output: html_document: keep_md: yes
This package includes functions for assessing the quality of single-cell transcriptomic data using cell type-specific marker gene expression, based on our paper, Assessing Transcriptome Quality in Patch-Seq Datasets. All of the code and data for reproducing the figures from the manuscipt is available at the patchSeqQCManuscript repository.
These functions were initially tailored for data collected using the patch-seq methodology, but are applicable to other single-cell RNAseq datasets. This package is meant as a tutorial to illustrate the approach outlined in our paper, but please contact us with feature requests and bug reports.
Use devtools to install. Additional github packages needs to be installed for it work.
devtools::install_github('PavlidisLab/patchSeqQC')
This package requires the following packages:
install.packages('ggplot2')
install.packages('pheatmap')
install.packages('dplyr')
install.packages('knitr')
This package includes sample data from Cadwell et al. 2016, which contains patch-seq data from layer 1 inhibitory interneurons and a small number of pyramidal cells.
cadwell_meta
includes basic sample metadata about each single-cell sample
data(cadwell_meta)
knitr::kable(head(cadwell_meta, 3))
sample_id major_type colors read_count ercc_count ercc_pct
ERR1156951 eNGC #F08080 42149548 383338 0.9094712 ERR1156956 eNGC #F08080 25616840 634201 2.4757191 ERR1156981 eNGC #F08080 30085020 1053275 3.5009948
sample_id: name of the samples. This needs to correspond to column names in the expression file.
major_type: Primary cell type names provided by Cadwell2016
colors: Unique color for cell type
read_count: Count of number of reads
ercc_count: Count of number of reads mapping to ERCC spike-ins
ercc_pct: Percent of reads mapping to ERCC spike-ins (relative to all sequenced reads)
cadwell_expr
includes gene expression data for these samples (stored as TPM + 1, transcripts per million)
data(cadwell_expr)
knitr::kable(head(cadwell_expr[, 1:7], 3))
ERR1156951 ERR1156956 ERR1156981 ERR1156983 ERR1156984 ERR1156986 ERR1156994
Gnai3 4.35 1 3.02 1.55 1.13 1 1 Pbsn 1.00 1 1.00 1.00 1.00 1 1 Cdc45 1.00 1 1.00 1.00 1.00 1 1
The central idea behind our single-cell quality metrics is that a high-quality single cell should strongly express markers of its own cell type and not express cell-type specific markers of other cell types. For example, we would expect that an cortical inhibitory neuron should strongly express GABAergic markers, like Gad2, Gad1, and not express markers of other cell types, such as markers of microglia like Tyrobp or C1qc.
To obtain cell type specific markers, we make use of mouse cortex-based single cell reference datasets which are comparable to the cell types profiled by in the Cadwell patch-seq dataset. Here, we provide pre-computed gene markers, markers
, based on the Tasic2016 and Zeisel2015 reference datasets.
When defining the markers, we have made the distinction between “on” and “off” markers. "On" markers, are genes that are highly expressed in the cell type of interest with enriched expression relative to other cell types. The second class, "Off" markers, are expected to be expressed at low levels in a given patch-seq cell type. These are genes that are specifically expressed in a single cell type (e.g., microglia) and, if expressed, are an indicator of possible cellular contamination.
data(markers)
lapply(markers,head, 5)
## $Astrocyte
## [1] "Gstm1" "F3" "Appl2" "Fxyd1" "Slc1a3"
##
## $Endothelial
## [1] "Crip1" "Tpm4" "Arhgap29" "Igfbp7" "S100a11"
##
## $Microglia
## [1] "Tyrobp" "Ctss" "C1qc" "Rrbp1" "Rnase4"
##
## $Oligodendrocyte
## [1] "Desi1" "Mobp" "Eml1" "Grb14" "Pllp"
##
## $OPC
## [1] "Fosb" "Cdo1" "Gpt2" "Vcan" "Rnf180"
##
## $Pyramidal
## [1] "Baiap2" "Slc17a7" "Ptk2b" "Nrn1" "Fhl2"
##
## $Inhibitory
## [1] "Npy" "Gad1" "Slc6a1" "Gad2" "Pcp4l1"
##
## $Ndnf_on
## [1] "Gad1" "Slc6a1" "Gad2" "Pcp4l1" "Fgf13"
##
## $Sncg_on
## [1] "Cnr1" "Gad1" "Sncg" "Synpr" "Zcchc12"
##
## $Pvalb_on
## [1] "Gad1" "Slc6a1" "Gad2" "Pvalb" "Nxph1"
##
## $Pyramidal_on
## [1] "Nrgn" "Arpp21" "Rgs4" "Ctxn1" "Nrn1"
To make use of these marker genes, we need to associate each single cell from the patch-seq dataset to one of the "On" marker types listed above. Here, we define a column contam_type
in the cadwell_meta
data frame and assign all of the interneurons from the Cadwell dataset to the cell type "Ndnf", and the excitatory cells to the "Pyramidal" type.
cadwell_meta$contam_type = 'Ndnf'
cadwell_meta[str_detect(cadwell_meta$major_type, 'Exc'), 'contam_type'] = 'Pyramidal'
knitr::kable(head(cadwell_meta %>% dplyr::select(sample_id, major_type, contam_type)))
sample_id major_type contam_type
ERR1156951 eNGC Ndnf ERR1156956 eNGC Ndnf ERR1156981 eNGC Ndnf ERR1156983 eNGC Ndnf ERR1156984 eNGC Ndnf ERR1156986 eNGC Ndnf
Below, we use the function plotMarkerHeatmap
to plot the expression of cell type specific markers of the elongated neurogliaform cells (eNGC) and pyramidal cells from the cadwell dataset. We also define the gene expression data frame, cadwell_engc_df
, which stores both metadata and gene expression values for the eNGC cells.
# plot marker expression of cadwell eNGCs
cadwell_engc_samples = cadwell_meta %>%
dplyr::filter(major_type == 'eNGC') %>%
dplyr::pull(sample_id) # get sample ids for eNGC cells
cadwell_engc_df = cadwell_expr[, cadwell_engc_samples]
cadwell_engc_df = cbind(cadwell_meta[cadwell_meta$sample_id %in% cadwell_engc_samples, ],
cadwell_engc_df %>% t()) # combines metadata and expression matrices and transposes it
rownames(cadwell_engc_df) = cadwell_engc_samples
plot_cell_types = c('Ndnf_on', 'Pyramidal', 'Microglia') # define which cell types to plot in heatmap
plot_marker_list = c(markers[plot_cell_types])
cadwell_engc_heatmap = plotMarkerHeatmap(markerlist = plot_marker_list, # named list of lists defining which markers to show
expr_matrix = cadwell_engc_df, # data frame that combines gene expr with metadata, each row is a single-cell
show_legend = T, # show color bar ?
show_cell_labels = T # show sample names in heatmap (defined by rownames of expr_matrix)
)
# plot expression of cadwell pyramidal cells
cadwell_pyr_samples = cadwell_meta %>% dplyr::filter(major_type == 'Exc') %>% dplyr::pull(sample_id)
cadwell_pyr_df = cadwell_expr[, cadwell_pyr_samples]
cadwell_pyr_df = cbind(cadwell_meta[cadwell_meta$sample_id %in% cadwell_pyr_samples, ], cadwell_pyr_df %>% t())
rownames(cadwell_pyr_df) = cadwell_pyr_samples
plot_cell_types = c('Pyramidal_on', 'Inhibitory', 'Microglia')
plot_marker_list = c(markers[plot_cell_types])
cadwell_pyramidal_heatmap = plotMarkerHeatmap(plot_marker_list, cadwell_pyr_df, show_legend = T, show_cell_labels = T )
We've provided a small version of the Tasic2016 single cell cortical reference dataset (with expression of only marker genes), stored in the variable aibsExprData
and quantified as TPM + 1. Metadata for these samples is available in aibsExprData
.
data(aibsExprData) # need to show that we have a field major types and contam_Type in here as well
knitr::kable(head(aibsExprData[1:5, 1:5]))
GSM1839240 GSM1839242 GSM1839239 GSM1839244 GSM1839241
Gstm1 1.00 1.00 3788.38 6690.01 2016.17 F3 1.00 1.00 1465.79 1599.76 1436.23 Appl2 1.00 2.21 1493.16 329.18 24.83 Fxyd1 1.00 8.10 1.00 250.00 45.87 Slc1a3 1.09 1.17 6329.91 4443.95 2992.33
data(aibsExprMeta)
knitr::kable(head(aibsExprMeta, 4))
sample_id aibs_type norm_sub_types norm_broad_types read_count ercc_count ercc_pct colors
GSM1839240 Vip Chat Inhibitory Inhibitory 4101528 139267 0.0339549 magenta GSM1839242 L2/3 Ptgs2 Pyramidal Pyramidal 4393012 104339 0.0237511 magenta GSM1839239 Astro Gja1 Astrocyte Astrocyte 3457441 348669 0.1008460 magenta GSM1839244 Astro Gja1 Astrocyte Astrocyte 3915440 473689 0.1209798 magenta
aibsExprDataDf = cbind(aibsExprMeta, aibsExprData %>% t()) # combines metadata and expression matrices
rownames(aibsExprDataDf) = aibsExprMeta$sample_id
sample_id: name of the samples. This needs to correspond to column names in the expression file.
aibs_type: Primary cell type names provided in Tasic2016
norm_sub_types: Cell type subtype names
norm_broad_types: Cell type broad names
After loading in the data from the Tasic2016 data, we can compare at the expression of cell type specific markers (for various cell types) to the same markers in patch-seq samples, Below, we use the function plotPatchSeqVsDissocMarkerExpr
for this purpose, which on the y-axis shows the average expression of markers across Ndnf
cells in the Tasic 2016 data and the x-axis shows expression of markers in a single patch-seq sample (one marker gene = one dot).
cell_types_to_plot = c('Ndnf_on', 'Astrocyte', 'Endothelial', 'Microglia', 'Oligodendrocyte', 'OPC', 'Pyramidal')
plotPatchSeqVsDissocMarkerExpr(patch_seq_df = cadwell_engc_df, # patch-seq data frame that combines gene expr with metadata, each row is a single-cell
dissoc_cell_df = aibsExprDataDf, # dissociated cell data frame that combines gene expr with metadata, each row is a single-cell
contam_type = 'Ndnf', # cell type to plot
plot_cell_sample_ids = c('ERR1156978', 'ERR1156984'), # sample_ids to show
plot_marker_types = cell_types_to_plot, # cell types to show markers for
markers = markers # named list of lists for marker genes
)
For each cell in the patch-seq dataset, we'll use the cell type specific marker genes to calculate summaries of each cell type's marker expression using the function calcContamAllTypes
. Each column is named a major cell type (i.e., a major cell type with defined markers in markers
) and stores the sum of each cell's expression of each cell type marker (in units log2 (TPM + 1)).
cadwell_contam = calcContamAllTypes(cadwell_engc_df, markers)
cadwell_contam$colors = "indianred1"
cadwell_contam$major_type = cadwell_engc_df$major_type
cadwell_contam$contam_type = "Ndnf_on"
knitr::kable(head(cadwell_contam[1:8], 3))
Astrocyte Endothelial Microglia Oligodendrocyte OPC Pyramidal Inhibitory Ndnf_on
ERR1156951 39.78814 5.392378 24.84645 41.51675 20.316027 20.115628 72.68619 132.2617 ERR1156956 14.04896 12.864828 65.34141 22.77214 10.900621 13.685440 73.92142 142.8065 ERR1156981 14.68443 4.939211 25.67249 11.63907 9.631161 9.782124 68.38971 138.6779
We'll use the same function calcContamAllTypes
to calculate summarized marker expression for each cell in the dissociated cell based Tasic2016 reference data, stored in the variable aibsExprDataDf
. Please note that it's essential that both the patch-seq data and the reference data are normalized using the same metric, TPM + 1 (transcripts per million) here.
aibs_ndnf_contam = calcContamAllTypes(aibsExprDataDf[str_detect(aibsExprDataDf$aibs_type, 'Ndnf'), ], markers)
aibs_ndnf_contam$colors = 'indianred1'
aibs_ndnf_contam$contam_type = 'Ndnf_on'
aibs_pyr_contam = calcContamAllTypes(aibsExprDataDf[str_detect(aibsExprDataDf$aibs_type, 'L'), ], markers)
aibs_pyr_contam$colors = "turquoise"
aibs_pyr_contam$contam_type = 'Pyramidal'
aibs_contam = rbind(aibs_ndnf_contam %>% dplyr::select(Ndnf_on, Pyramidal, colors, contam_type),
aibs_pyr_contam %>% dplyr::select(Ndnf_on, Pyramidal, colors, contam_type)
)
We can use the function plotPatchSeqDissocCellMarkerExpr
to plot the summarized marker expression for both the patch-seq and the reference datasets (one dot = one cell).
plotPatchSeqDissocCellMarkerExpr(patch_seq_contam_df = cadwell_contam,
dissoc_cell_contam_df = aibs_contam,
on_type = 'Ndnf_on',
off_type = 'Pyramidal')
Move this somewhere else
aibsExprDataDf$contam_type = aibsExprDataDf$norm_broad_type
aibs_contam_all_broad = calcContamAllTypes(aibsExprDataDf, markers)
aibs_contam_all_broad$contam_type = factor(aibsExprDataDf$contam_type)
aibs_med_exprs_broad = aibs_contam_all_broad %>% dplyr::group_by(contam_type) %>% dplyr::summarize_all(median) %>% as.data.frame()
rownames(aibs_med_exprs_broad) = aibs_med_exprs_broad$contam_type
aibsExprDataDf$contam_type = paste0(aibsExprDataDf$norm_sub_type, '_on')
aibs_contam_all_sub = calcContamAllTypes(aibsExprDataDf, markers)
aibs_contam_all_sub$contam_type = factor(aibsExprDataDf$contam_type)
aibs_med_exprs_sub = aibs_contam_all_sub %>% dplyr::group_by(contam_type) %>% dplyr::summarize_all(median) %>% as.data.frame()
rownames(aibs_med_exprs_sub) = aibs_med_exprs_sub$contam_type
aibs_med_exprs = rbind(aibs_med_exprs_broad, aibs_med_exprs_sub[c('Ndnf_on', 'Sncg_on', 'Pvalb_on', 'Pyramidal_on'), ])
compare_cell_types_inh = setdiff(names(markers), c('Ndnf_on', 'Pvalb_on', 'Sncg_on', 'Pyramidal_on', 'Inhibitory')) %>% make.names()
compare_cell_types_exc = setdiff(names(markers), c('Pyramidal_on', 'Pvalb_on', 'Sncg_on', 'Ndnf_on', 'Pyramidal')) %>% make.names()
exc_cell_types = c('CA1-PYR', 'Exc', 'RS-PYR', 'BS-PYR')
We have developed a number of metrics to evaluate the quality of single cell samples by comparison to dissociated cell reference data that can be calculated using the function calculatePatchSeqQCMetrics
.
cadwell_engc_qc_df = calculatePatchSeqQCMetrics(cadwell_engc_df, comparison_dataset = 'aibs', markers)
knitr::kable(head(cadwell_engc_qc_df %>% dplyr::select(sample_id:quality_score)))
sample_id major_type contam_type marker_sum marker_sum_norm contam_sum quality_score
1 ERR1156951 eNGC Ndnf 132.26171 0.8447999 1.0949213 0.5212651 2 ERR1156956 eNGC Ndnf 142.80646 0.9121527 1.0306555 0.3574985 11 ERR1156981 eNGC Ndnf 138.67790 0.8857822 0.4750284 0.4342528 12 ERR1156983 eNGC Ndnf 106.10947 0.6777567 2.3589052 0.2623394 13 ERR1156984 eNGC Ndnf 85.17883 0.5440657 2.2399839 0.3908998 14 ERR1156986 eNGC Ndnf 146.43814 0.9353495 0.5143846 0.6244818
sample_id: name of the samples.
major_type: cell type identities (provided by Cadwell2016)
contam_type: cell type identities (normalized to cell type names in markers
)
marker_sum: Summed expression of "On" cell type marker genes (with cell type defined by contam_type
)
marker_sum_norm: Normalized summed expression of "on"-type marker genes, normalized to median expression of same cell type in dissociated-cell reference data
contam_sum: Contamination score, defined as the sum of normalized expression across all "off" cell types defined in compare_cell_types_inh
quality_score: Quality score, defined as the Spearman correlation of marker expression between markers expressed in single cell with mean expression of markers in dissociated cell reference data
This function also outputs normalized expression of each "off"-cell type (defined in compare_cell_types_inh
) and we can use the function plotContamHeatmap
to show these (each column is one single cell)
knitr::kable(head(cadwell_engc_qc_df %>% dplyr::select(sample_id, Astrocyte:Pyramidal)))
sample_id Astrocyte Endothelial Microglia Oligodendrocyte OPC Pyramidal
1 ERR1156951 0.1088640 0.0941299 0.1505082 0.1658869 0.2308079 0.3447244 2 ERR1156956 0.0050728 0.2327905 0.4197299 0.0677605 0.0721286 0.2331731 11 ERR1156981 0.0076353 0.0857208 0.1560000 0.0094799 0.0507342 0.1654582 12 ERR1156983 0.3854968 0.2725112 0.3288887 0.2238120 0.1345609 1.0136356 13 ERR1156984 0.2424618 0.1834264 0.3076201 0.2217693 0.2237593 1.0609470 14 ERR1156986 0.0427454 0.0167817 0.0475493 0.0363846 0.0757241 0.2951996
contam_plot_types = factor(c('Astrocyte', 'Endothelial', 'Microglia', 'Oligodendrocyte', 'OPC', 'Pyramidal'), levels = c('Astrocyte', 'Endothelial', 'Microglia', 'Oligodendrocyte', 'OPC', 'Pyramidal'))
contam_mat = getContamMatrixFromPatchSeq(cadwell_engc_qc_df, 'eNGC', contam_show_types = contam_plot_types)
summarized_contam_heatmap = plotContamHeatmap(contam_mat, show_cell_labels = T)
We plot the marker_sum
, contam_sum
and quality_score
variables below
p1 = ggplot2::ggplot(data = cadwell_engc_qc_df, aes(x = marker_sum, y = contam_sum)) +
geom_point() + geom_smooth(method = "lm", se = F)
p2 = ggplot2::ggplot(data = cadwell_engc_qc_df, aes(x = marker_sum, y = quality_score)) +
geom_point() + geom_smooth(method = "lm", se = F)
plot_grid(p1, p2, nrow = 1)
Our single-cell quality metrics are dependent on the following factors: 1) being able to confidently assign a single-cell's major cell type, ideally without using its transcriptomic data. 2) the availability of high-quality single-cell reference data to use for comparing the expression of observed marker genes to expected marker genes.
We note that there are likely situations where either or both of these do not hold, for example, if the cells profiled have unknown types or if reference data is not available (for example, patch-seq data collected from a brain region without adequate cell-dissociation based single-cell atlases).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.