knitr::opts_chunk$set(echo = TRUE) library(knitr) library(ggplot2) devtools::load_all()
This package includes functions responsible for marker gene selection and marker gene profile estimation estimation as described in Mancarci et al. 2017. It also includes a copy of mouse brain cell type markers from the neuroExpressoAnalysis package for convenience along with mock data for easy testing.
Use devtools to install. Additional github packages needs to be installed for it work.
devtools::install_github('oganm/markerGeneProfile')
In this document additional packages are used that are not package dependencies
install.packages('ggplot2') install.packages('gplots') install.packages('viridis') install.packages('dplyr') install.packages('knitr')
A list of marker genes specific to or enriched in a cell type is required in order to estimate cell type profiles. In this package, a copy of mouse brain cell type-specific markers from neuroExpressoAnalysis, the package that summarizes the entire analysis performed in Mancarci et al. 2017 is included (mouseMarkerGenes
). If an different marker gene set is needed, steps outlined below can be followed to create one from a new dataset.
This package includes a sample cell type-specific transcriptomic dataset representing expression profiles from multiple purified cell types aimed to demonstrate the minimal information required for the selection of marker genes.
mgp_sampleProfilesMeta
includes the basic metadata required for the cell type specific expression dataset.
data(mgp_sampleProfilesMeta) knitr::kable(head(mgp_sampleProfilesMeta))
sampleName: name of the samples. This needs to correspond to column names in the expression file.
replicate: A vector marking which samples are replicates of each other.
PMID: A vector marking which samples come from the same study. Normally taking PMIDs of the papers is a good idea.
CellType: A vector marking the cell types that the samples represent.
region: The regions samples are extracted from. Only needed if region specific genes are to be selected.
RegionToParent: If region specific genes are to be selected and a region hierarchy is to be used, this column controls whether or not the sample should be included in the parent regions of the indicated region. If not provided it will default to TRUE
. The name of this column is hard coded and should not be changed.
RegionToChildren: Same as above except it controls if the sample should be included in the children regions. If not provided it will default to TRUE
. The name of this column is hard coded and should not be changed.
mgp_sampleProfiles
is a sample expression data. Gene.Symbol column is the gene identifier that should be composed of unique IDs while the rest are sample names that corresponds to the relevant column in the metadata file. Other columns can be present before the sample data but they should not be of class double
.
data(mgp_sampleProfiles) knitr::kable(mgp_sampleProfiles)
mpg_sampleRegionHiearchy
is a sample region hiearchy. It is a nested named list.
data(mpg_sampleRegionHiearchy) mpg_sampleRegionHiearchy nametree(mpg_sampleRegionHiearchy)
In this example Region 1
and Region 2
are subsets of All
region
Marker gene selection is performed using three functions: markerCandidates
, pickMarkers
and rotateSelect
. By default markerCandidates
will return files for each cell type in a region that lists the gene that are above a given silhouette and fold change thresholds. Other variables are briefly explained below but see the package documentation for in depth explanations
markerCandidates(design = mgp_sampleProfilesMeta, # the design file expression = mgp_sampleProfiles, # expression file outLoc = 'README_files/quickSelection', # output directory groupNames = 'CellType', # name of the column with cell types. can be a vector regionNames = 'region', # name of the column with brain regions. leave NULL if no region seperation is desired PMID = 'PMID', # name of the column with study identifiers sampleName = 'sampleName', # name of the column with sample names replicates = 'replicate', # name of the column with replicates foldChangeThresh = 10, # threshold of fold change for gene selection (default is 10) minimumExpression = 8, # minimum expression level that a gene can be considered a marker gene (default is 8) background = 6, # background level of expression (default is 6) regionHierarchy = mpg_sampleRegionHiearchy, # hierarchy of brain regions to be used geneID = 'Gene.Symbol', # column name with with gene idenditifers cores = 8 # number of cores to use in parallelization )
This creates 3 directories in the output directory
list.files('README_files/quickSelection')
The CellType
directory is a list of marker genes that disregards all region specifications (redundant with All_CellType
in this case) while Region 2_CellType
and All_CellType
directories inlcude cell types from the relevant region. Note the absence of Region 1_CellType
since that region only has a single cell type.
read.table('README_files/quickSelection/All_CellType/Cell C') %>% knitr::kable()
This file shows the candidate genes for cell type Cell C
in region All
. The first column is the gene identifier, the second is change in expression in log_2 scale and the third one is the silhouette coefficient. Note that Gene6
is absent since its expression level was below the minimum level allowed. markerCandidates
function does not apply a threshold for silhouette coefficient it also doesn't check to see if a gene satisfies fold change threshold for multiple genes. pickMarkers
function does that.
pickMarkers('README_files/quickSelection/All_CellType/', foldChange = 1, # this is a fixed fold change threshold that ignores some leniency that comes from markerCandidates. setting it to 1 makes it irrelevant silhouette = 0.5)
If all genes for all regions needs to be seen
pickMarkersAll('README_files/quickSelection', foldChange = 1, silhouette = 0.5)
The above method is a quick way to pick markers but it does not handle bimodality in expression distribution well. To ensure robustness of the results it is better to perform multiple selections with permutations. markerCandidates
function has variables to handle permutations for you. rotate
controls what is the percentage of samples that should be removed every time. seed controls the random seed and is there to ensure reproducibility.
for (i in 1:10){ markerCandidates(design = mgp_sampleProfilesMeta, # the design file expression = mgp_sampleProfiles, # expression file outLoc = file.path('README_files/Rotation',i), # output directory groupNames = 'CellType', # name of the column with cell types. can be a vector regionNames = 'region', # name of the column with brain regions. leave NULL if no region seperation is desired PMID = 'PMID', # name of the column with study identifiers sampleName = 'sampleName', # name of the column with sample names replicates = 'replicate', # name of the column with replicates foldChangeThresh = 10, # threshold of fold change for gene selection (default is 10) minimumExpression = 8, # minimum expression level that a gene can be considered a marker gene (default is 8) background = 6, # background level of expression (default is 6) regionHierarchy = mpg_sampleRegionHiearchy, # hierarchy of brain regions to be used geneID = 'Gene.Symbol', # column name with with gene idenditifers cores = 8, # number of cores to use in parallelization rotate = 0.33, seed = i ) }
This creates multiple selection directories. rotateSelect
can be used to count the number of times a gene is selected for each cell type in each region. This creates another directory similar to the output of markerCandidates
. Again, valid markers can be acquired using pickMarkers
rotateSelect(rotationOut='README_files/Rotation', rotSelOut='README_files/RotSel', cores = 8, foldChange = 1 # this is a fixed fold change threshold that ignores some leniency that comes from markerCandidates. setting it to 1 makes it irrelevant )
pickMarkers('README_files/RotSel/All_CellType/',rotationThresh = 0.95) pickMarkersAll('README_files/RotSel',rotationThresh = 0.95)
The package includes mouse brain cell type markers published in Mancarci et al. 2017 and gene expression data from substantia nigra samples from healthy donors and Parkinson's disease patients by Lesnick et al. 2007.
Mouse marker genes is available in mouseMarkerGenes
object as a nested list.
data(mouseMarkerGenes) names(mouseMarkerGenes) lapply(mouseMarkerGenes$Midbrain[1:3],head, 14)
Available Lesnick et al. data is stored in mgp_LesnickParkinsonsExp
and mgp_LesnickParkinsonsMeta
objects
library(dplyr) data(mgp_LesnickParkinsonsExp) mgp_LesnickParkinsonsExp %>% dplyr::select(-GeneNames) %>% head %>% {.[,1:6]}
data(mgp_LesnickParkinsonsMeta) mgp_LesnickParkinsonsMeta %>% head
Before MGP estimation, it is important to filter expression data of low expressed genes and make sure all genes are represented only once in the dataset. While there are many probeset summarization and methods, for this work we chose the most variable probeset and remove all probes with a median expression below the median expression of the dataset.
unfilteredParkinsonsExp = mgp_LesnickParkinsonsExp # keep this for later medExp = mgp_LesnickParkinsonsExp %>% sepExpr() %>% {.[[2]]} %>% unlist %>% median # mostVariable function is part of this package that does probe selection and filtering for you mgp_LesnickParkinsonsExp = mostVariable(mgp_LesnickParkinsonsExp, threshold = medExp, threshFun= median)
MGPs are simply the first principal component of marker gene expression. This method of marker gene profile estimation is similar to the methodology of multiple previous works that aim to estimate relative abundance of cell types in a whole tissue sample (Chikina et al., 2015; Westra et al., 2015; Xu et al., 2013).
Primary function that deals with MGP estimation is mgpEstimate
. This function will take in expression data and marker gene lists to return MGPs. exprData
is the expression matrix which should include gene identifiers as a column. Other columns can be present at the beginning of the data frame but should not be of class double
. genes
is the list of markers. It is assumed to be a list of character vectors, each vector containing gene names for a cell type. geneColName
is the name of the column where gene names are found. geneTransform
is a function that will be applied to the gene names provided in genes
. Note that this by default tranforms mouse gene names to human gene names. Set it to NULL if this is not desired.
Below a basic estimation performed with other important variables briefly explained.
estimations = mgpEstimate(exprData=mgp_LesnickParkinsonsExp, genes=mouseMarkerGenes$Midbrain, geneColName='Gene.Symbol', outlierSampleRemove=F, # should outlier samples removed. This is done using boxplot stats. geneTransform =function(x){homologene::mouse2human(x)$humanGene}, # this is the default option for geneTransform groups=mgp_LesnickParkinsonsMeta$disease, #if there are experimental groups provide them here. if not desired set to NULL seekConsensus = FALSE, # ensures gene rotations are positive in both of the groups removeMinority = TRUE) # removes genes if they are the minority in terms of rotation sign from estimation process
Dopamine rgic cell loss is a known effect of Parkinson's Disease. To see if this effect can be observed we can look at dopaminergic MGPs in healthy donors vs Parkinson's Disease patients
library(ggplot2) ls(estimations) ls(estimations$estimates) dopaminergicFrame = data.frame(`Dopaminergic MGP` = estimations$estimates$Dopaminergic, state = estimations$groups$Dopaminergic, # note that unless outlierSampleRemove is TRUE this will be always the same as the groups input check.names=FALSE) ggplot2::ggplot(dopaminergicFrame, aes(x = state, y = `Dopaminergic MGP`)) + geom_boxvio() + geom_jitter(width = .05) # this is just a convenience function that outputs a list of ggplot elements.
To see if the difference between the two groups is statistically significant we use wilcoxon test (Mann-Whitney test). Wilcoxon-test is a non parametric tests that does not make assumptions about the distribution of the data. We chose to use it because marker gene profiles are not normally distributed.
group1 = estimations$estimates$Dopaminergic[estimations$groups$Dopaminergic %in% "Control"] group2 = estimations$estimates$Dopaminergic[estimations$groups$Dopaminergic %in% "PD"] wilcox.test(group1,group2)
Based on these results we can say that there is indeed a significant difference between doparminergic marker gene profiles between control and parkinson's disease patients.
This difference can be also observed by looking at gene expression of markers
estimations$usedMarkerExpression$Dopaminergic%>% as.matrix %>% gplots::heatmap.2(trace = 'none', scale='row',Rowv = FALSE,Colv = FALSE, dendrogram = 'none', col= viridis::viridis(10),cexRow = 1.5, cexCol = 1, ColSideColors = estimations$groups$Dopaminergic %>% toColor(palette = c('Control' = 'blue', "PD" = "red")) %$% cols , margins = c(7,8))
Indeed in general most marker genes have a high expression is control samples which are shown in green.
It is important to note that not all marker genes are used in the estimation of marker gene profiles. Below we see all genes that are identified as dopaminergic cell markers with human orthologues. Yet not all these genes appear in the heatmap above
mouseHumanGeneTable = mouseMarkerGenes$Midbrain$Dopaminergic %>% homologene::mouse2human() allHumanDopaGenes = mouseHumanGeneTable %$% humanGene mouseHumanGeneTable
Some of these genes are removed because they are not included in our dataset, either because they are not in the platform, or because the gene was filtered in the pre-processing stage due to low expression.
allHumanDopaGenes[!allHumanDopaGenes %in% mgp_LesnickParkinsonsExp$Gene.Symbol] allGenes = allHumanDopaGenes[allHumanDopaGenes %in% mgp_LesnickParkinsonsExp$Gene.Symbol]
Other genes can be removed because mgpEstimate thinks they don't correlate well with the rest of the genes. Details of this process can be found in Mancarci et al. 2017 manuscript
allGenes[!allGenes %in% rownames(estimations$usedMarkerExpression$Dopaminergic)] genesUsed = rownames(estimations$usedMarkerExpression$Dopaminergic)
By looking at the unfiltered expression data, we can see how these two genes were unsuitable for MGP estimation
toPlot = unfilteredParkinsonsExp[unfilteredParkinsonsExp$Gene.Symbol %in% homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic)$humanGene,] %>% mostVariable(threshold = 0) toPlot %>% mostVariable(threshold = 0) %>% sepExpr() %>% {.[[2]]} %>% as.matrix() %>% apply(1,function(x){scale01(x)}) %>% t %>% {rownames(.) = toPlot$Gene.Symbol[toPlot$Gene.Symbol %in% homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic)$humanGene];.} %>% reshape2::melt() %>% {colnames(.) = c('Gene','Sample','Expression');.} %>% dplyr::mutate(`Is used?` = rep('used',length(Gene)) %>% {.[Gene %in% 'CHRNA6'] = '(CHRNA6) not expressed';.[Gene %in% 'PRKCG'] = '(PRKCG) not correlated';.}) %>% ggplot(aes(y = Expression, x = Sample, group = Gene, color = `Is used?`)) + geom_line() + cowplot::theme_cowplot() + theme( axis.text.x= element_blank(), axis.title = element_text(size = 20), legend.text = element_text(size = 17), legend.title = element_text(size=22), plot.title = element_text(size = 20) ) + ggtitle('Scaled expression of markers') toPlot %>% mostVariable(threshold = 0) %>% sepExpr() %>% {.[[2]]} %>% as.matrix()%>% {rownames(.) = toPlot$Gene.Symbol[toPlot$Gene.Symbol %in% homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic)$humanGene];.} %>% reshape2::melt() %>% {colnames(.) = c('Gene','Sample','Expression');.} %>% dplyr::mutate(`Is used?` = rep('used',length(Gene)) %>% {.[Gene %in% 'CHRNA6'] = 'CHRNA6 - not expressed';.[Gene %in% 'PRKCG'] = 'PRKCG - not correlated';.}) %>% ggplot(aes(y = Expression, x = Sample, group = Gene, color = `Is used?`)) + geom_line() + cowplot::theme_cowplot() + theme( axis.text.x= element_blank(), axis.title = element_text(size = 20), legend.text = element_text(size = 17), legend.title = element_text(size=22), plot.title = element_text(size = 20) )+ ggtitle('Nonscaled expression of markers')
Both of these genes seem to be non-expressed though PRKCG managed to be just above the removal threshold. Luckily lack of correlation reveals that it is not behaving as other marker genes.
The ratio of genesUsed
and allGenes
(all markers available in the study) can be used as a confidence metric. If a significant portion of the genes do not correlate well with each other
that may point to presence of non cell type specific signal (regulation or noise). For all cell types this ratio is outputted. You'll see a warning if this
ratio ever exceeds 0.4
estimations$removedMarkerRatios
The ratio for dopaminergic cells is fairly low. We can also look at the amount of varience explained in the first PC of marker gene expression. If this value is low, it can point to higher amounts of confounding noise or regulation.
estimations$trimmedPCAs$Dopaminergic %>% summary()
For instance unlike dopaminergic cells, cholinergic cells seem to have a higher proportion of their marker genes removed. Looking at the variation explained by first PC reveals that first PC only explains 28% of all variance.
estimations$trimmedPCAs$BrainstemCholin %>% summary()
Such a result can occur if
In this tutorial homologene database is used through homologene package. Note that the default version of this database is somewhat old which means some annotations may have been changed. The homologene
package also includes an updated version of the homologene package to make sure you get all the correct matches.
If we look at dopaminergic cell markers for instance we could have gotten an extra gene (SCN2A) if we used the updated version.
homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic) %$% humanGene homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic, db = homologene::homologeneData2) %$% humanGene
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.