library(knitr)
library(tispec)
knitr::opts_chunk$set(
    out.width = "600px", # sets image sizes
    fig.align = "center",
    collapse = TRUE,
    comment = ">"
)

wd <- paste(getwd(), '/images/', sep = '') # should work on any machine

Introduction

This package is an adaptable and efficient implementation of the tau specificity algorithm^[Yanai, I., Benjamin, H., Shmoish, M., Chalifa-Caspi, V., Shklar, M., Ophir, R., Bar-Even, A., Horn-Saban, S., Safran, M., Domany, E., Lancet, D., Shmueli, O., 2005. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics (Oxford, England). 21, 650-659.] benchmarked in 2016 by Kryuchkova-Mostacci & Robinson-Rechavi^[Kryuchkova-Mostacci, N. & Robinson-Rechavi, M., 2016. A benchmark of gene expression tissue-specificity metrics. Briefings in Bioinformatics.]. To utilise this downstream RNA-seq analysis step, the user first runs their own RNA-seq analysis pipeline and normalisation steps to produce a set of counts (e.g. FPKM, RPKM, TPM) for each gene in each tissue according to the user preference. Many algorithms have been developed for measuring tissue specificity, and they all fall into 1 of 2 categories:

The tau specificity algorithm was implemented in this package because:

Using both human and mouse RNA-seq data, the benchmark paper compared 9 algorithms:

The results of the benchmark paper (fig. 1) show that every algorithm produces a bimodal distribution identifying 3 clear groups:

The authors identified the best performing algorithms as tau, gini and counts. However the authors state that counts is only robust if the correct expression threshold/cut-off is chosen and gini doesn’t normalise for expression amplitude between tissues, while tau normalises for amplitude between tissues and also includes the number of tissues as part of the algorithm.

include_graphics(paste(wd,'benchmark.png',sep = ''))

The key conclusions of the benchmark were as follows:

Workflow

include_graphics(paste(wd,'workflow.png',sep = ''))

The workflow (fig. 1) first log transforms the expression values provided by the user. To make cross-tissue comparisons possible, quantile normalizing is performed on the entire dataframe, after which each gene is given a bin value between 0-10 for each tissue. The higher the bin value, the greater the expression. The list of bin values of a gene is then passed to the tau algorithm to calculate it's specificity for any tissue. Finally, the specificity of each gene for each individual tissue is calculated (tau expression fraction, tef).

Input

Users run their preferred RNA-seq analysis pipeline to calculate a single expression value for each gene in each tissue (e.g. mean TPM). Currently the package supports data from mouse, human and macaque. All genes should named by their Ensembl Gene IDs (e.g. ENSMUSG00000066232). This is to allow biomaRt annotation later in the workflow. Take care to annotate all genes using the same ensembl version (all versions from ensembl 79 onward are supported). This is important to ensure no genes are excluded from the pipeline due to discontinued Ensembl Gene IDs. The package takes as input a single dataframe where the first column is the Ensembl Gene ID, each subsequent column header is a tissue name, and the cells contain the user defined expression values of all genes in all tissues.

library(tispec)
head(meanExp[,1:5], n=5)

checkInput(meanExp[,1:5])

The checkInput function evaluates the cells of the dataframe to ensure it is in the correct format i.e. The function returns errors when it finds cells that contain character strings, NAs, negative values or infinite values. In the example above, only the first five tissues are checked. The number of incorrectly formatted cells are returned for each tissue.

Normalisation

For each tissue independently, the input data should be normalised so that differences in expression between 1 gene and another within the tissue can be compared while reducing the effect of outliers on the output. A pseudo count is added to log2 transform the data, and a threshold of log2(0) is set. This means that all genes with input values below 1.0 will be defined as non-expressed. All genes not expressed in any tissue are then removed from the data. Finally, to allow comparison of gene expression across different tissues that may come from different experiments/laboratories, the log2 transformed dataframe is quantile normalised which makes the distribution of data in each tissue statistically identical to each other. Remember, now the expression values are on a log scale, so +1 difference represents a 10x difference in expression.

log2Exp <- log2Tran(meanExp) 
head(log2Exp[,1:5], n=5)
qnExp <- quantNorm(log2Exp)
head(qnExp[,1:5], n=5)

Calculate Tissue Specificity

After normalisation, implementation of the tau specificity algorithm assigns each gene a tau value between 0-1.

The tau value only defines the "specificity" of a gene. To determine which tissue the gene is actually specific for, requires calculation of tau expression fractions (tef). As can be seen in the worked example in figure 2, gene X is a HSG with a tau value of 0.95. However, calculating tef identifies that Gene X is highly specific for tissue B (0.95 tef), with additional but much lower specificity for tissue C (0.24 tef). Targetting gene X in tissue B, may produce off-target effects in tissue C.

tauExp <- calcTau(qnExp) 
head(tauExp[,1:5], n=5)

The output of 'calcTau' now begins with an extra column containing the tau value of each gene, while each tissue column is now populated with tau expression fractions.

Biomart Annotation {#getMart}

Next, using the biomaRt package, the gene name and type are added to each ensembl gene ID.

tauAnno <- getMart(x = 'mouse', y = 79, z = tauExp)
head(tauAnno[,1:5], n=5)

The dataframe output now contains all the information required for easy user interaction.

Additional Useful Functions

Specific gene distribution

plotDensity(tauAnno) 

Notice the similarity between the shape of this plot and tau plotted in Figure 1 with a peak at tau>0.95. The plots are not highly similar because here a much reduced dataset is plotted (~9k genes only). Expect a distribution similar to Figure 1 when all annotated genes are included (~40k genes).

Specific gene counts

asg <- getDist(tauAnno, 1)
head(asg, n = 5)

hsg <- getDist(tauAnno, 0.85)
head(hsg, n = 5)
plotDist(tauAnno)

The distribution plot identifies tissueM and tissueO as the tissues with the highest number of absolutely specific genes (ASGs, tau = 1) and highly tissue specific genes (HSGs, tau >= 0.85)

Study a tissue of interest

Extract your tissue

tissueA <- getTissue('tissueA', qnExp, tauAnno)
head(tissueA[, -2], n = 5)

The function getTissue retrieves the quantile normalised expression and tissue specificity of every gene in a named tissue and uses this information to create a score column. Each gene's score is between 0 and 2, and is the sum of it's tau expression fraction value and it's 0-1 ranged normalised expression value. The gene with both the highest expression and specificity recieves the highest score.

Notice Scml2 is a highly specific gene (tau 0.964) but not for tissueA (frac 0.000) and is not even expressed in tissueA (qn 0.000). To identify for which tissue Scml2 is specific requires the plotGene function. However, organising the data like this allows multiple other useful functions: getOptimum, plotCorr, and getControls.

Get a highly specific, highly expressed, optimal gene set {#getOptimum}

Ranking of genes by both expression and specificity is useful for anyone working on a single tissue wanting to identify a set of genes 1) that are highly specific to the tissue, 2) that are expressed in high enough quantities (facilitates bench work in the laboratory), and 3) with minimal expression in other tissues (limits off target effects)

optimum <- getOptimum(tauAnno, tissueA, 5)
optimum$dataframe[, -2]

optimum$barplot

The plot shows the top 10 optimal genes for tissueA, and their specificity for only the tissues in which they are expressed. Tial1 is absolutely specific for tissueA (frac 1.000) while Col4a3 is highly specific for tissueA (frac 0.943). Both genes are highly expressed in tissueA, but Col4a3 is also minimally expressed in tissueI (Use plotGene to confirm). The plot shows all other tissues in which those genes are expressed so as to facilitate prediction of potential off target affect of any biological experiments (e.g. CRISPR).

Extract the gene expression values to see the high expression in tissueA, and minimal expression elsewhere:

meanExp[c(rownames(optimum$dataframe)),]

Plot correlation between specificity and expression {#plotCorr}

Visualising the relationship between specificity and correlation in a tissue helps us conceptualise the nature of tissue specific genes in a tissue. Also, with this function a gene of interest can be highlighted to see how it compares to the rest of the genes in the tissue in terms of expression and specificity.

corrPlots <- plotCorr(tissueA, c('Col4a3', 'Mboat7'))
corrPlots$tauPlot

By plotting the tau value against expression we see that as the specificity of a gene for ANY tissue increases, the expression of the gene in tissueA decreases. There is a negative correlation (r = -0.58) as reflected in the trendline. However, many tissueA absolutely specific genes (orange) still have relatively high expression in tissueA. We also see how genes of interest passed by the user are highlighted (pink): both are highly expressed and highly specific, although from this plot it is not known for which tissue they are specific because only the tau value was used.

corrPlots$fracPlot

By plotting the tau expression value against expression the trendline now shows a positive correlation (r = 0.52) suggesting that as specificity of a gene for tissueA increases, so does expression of the gene in tissueA. As the linear model may be underfitting, I also used a generalized additive model to plot the trendline (not shown) which revealed an initial strong positive correlation that plateaus at 0.1 specificity suggesting that specificity above 0.1tef is a poor indicator of expression, and vice versa. This is clearly illustrated by the wide range of expression of genes that are absolutely specific (orange) or highly specific (yellow) for tissueA. From this plot we can now see that of both genes passed by the user, only Col4a3 is highly specific to tissueA because the tau expression fractions for tissueA were used. To extract genes that are both highly specific and highly expressed use getOptimum. Try passing the results of getOptimum to this function: in what area of the scatter plot do you think those genes will be plotted?

Get a set of control genes {#getControls}

This function is useful for differentiating between two tissues. For example, to determine if a small sample of suprachiasmtic nucleus brain tissue has been contaminated with cortex brain tissue during tissue dissection. Performing qPCR on the top ranked genes returned by the function will confirm/reject contamination.

controls <- getControls(tissueA, tissueB)

head(controls$tissueA[, -2], n = 1)

head(controls$tissueB[, -2], n = 1)

In the example shown, we are trying to determine if tissueA has been contaminated with tissueB. The 2 top ranked genes returned are specific for, and highly expressed in, their respective tissue but also have less than 0.1 quantile normalised expression in the other tissue. So performing qPCR on tissueA to identify the amount of Arl11 (positive control) and Mboat7 (negative control), will confirm if tissueA has been contaminated with tissueB.

Study a gene of interest {#plotGene}

Quite often a researcher works on a specific gene rather than a tissue and may find this function useful to identify the specificity of a single gene for any and all tissues. This function can also be used to confirm the results from other functions in this package e.g. getOptimum.

plotGene(tauAnno, 'Mboat7') # gene name is case sensitive

This function is simply plotting the gene from the result of getMart:

subset(tauAnno, tauAnno$external_gene_name == 'Mboat7')

Viewing the input expression values of the gene confirms the plot output:

id <- rownames(subset(tauAnno, tauAnno$external_gene_name == 'Mboat7'))
round(meanExp[id, ], digits = 3)


roonysgalbi/tispec documentation built on May 26, 2019, 1:33 a.m.