In roonysgalbi/tispec: Calculates tissue specificity from RNA-seq data

library(knitr)
library(tispec)
knitr::opts_chunk$set(
    out.width = "600px", # sets image sizes
    fig.align = "center",
    collapse = TRUE,
    comment = ">"
)

wd <- paste(getwd(), '/images/', sep = '') # should work on any machine

Introduction

This package is an adaptable and efficient implementation of the tau specificity algorithm^[Yanai, I., Benjamin, H., Shmoish, M., Chalifa-Caspi, V., Shklar, M., Ophir, R., Bar-Even, A., Horn-Saban, S., Safran, M., Domany, E., Lancet, D., Shmueli, O., 2005. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics (Oxford, England). 21, 650-659.] benchmarked in 2016 by Kryuchkova-Mostacci & Robinson-Rechavi^[Kryuchkova-Mostacci, N. & Robinson-Rechavi, M., 2016. A benchmark of gene expression tissue-specificity metrics. Briefings in Bioinformatics.]. To utilise this downstream RNA-seq analysis step, the user first runs their own RNA-seq analysis pipeline and normalisation steps to produce a set of counts (e.g. FPKM, RPKM, TPM) for each gene in each tissue according to the user preference. Many algorithms have been developed for measuring tissue specificity, and they all fall into 1 of 2 categories:

Algorithms that reduce the tissue specificity of a gene to a single value as a global measure of the gene’s specificity
Algorithms that reduce the tissue specificity of a gene to a set of values, 1 per tissue, as a profile of the specificity of a gene for each tissue

The tau specificity algorithm was implemented in this package because:

Tau is consistently robust
Tau results between different data set sizes show the highest correlation
Tau does not show bias toward finding housekeeping genes
Tau finds the greatest number of tissue specific genes
Enrichment analysis shows that the functions of those tissue specific genes found only by tau are consistent with the functions of the tissues in which they are found

Using both human and mouse RNA-seq data, the benchmark paper compared 9 algorithms:

Raw counts
Tau^[Yanai, I., Benjamin, H., Shmoish, M., Chalifa-Caspi, V., Shklar, M., Ophir, R., Bar-Even, A., Horn-Saban, S., Safran, M., Domany, E., Lancet, D., Shmueli, O., 2005. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics (Oxford, England). 21, 650-659.]
Gini coefficient^[Lidia Ceriani & Paolo Verme, 2012. "The origins of the Gini index: extracts from Variabilità e Mutabilità (1912) by Corrado Gini," The Journal of Economic Inequality, Springer;Society for the Study of Economic Inequality, vol. 10(3), pages 421-443, September.]
Tissue specificity index (TSI)^[Julien, P., Brawand, D., Soumillon, M., Necsulea, A., Liechti, A., Schutz, F., Daish, T., Grutzner, F., Kaessmann, H., 2012. Mechanisms and evolutionary patterns of mammalian and avian dosage compensation. PLoS Biology. 10, e1001328.]
Shannon entropy (Hg)^[Schug, J., Schuller, W.P., Kappen, C., Salbaum, J.M., Bucan, M., Stoeckert, C.J.,Jr, 2005. Promoter features related to tissue specificity as measured by Shannon entropy. Genome Biology. 6, R33.]
Z-score^[Vandenbon, A. & Nakai, K., 2010. Modeling tissue-specific structural patterns in human and mouse promoters. Nucleic Acids Research. 38, 17-25.]
Specificity measure (SPM)^[Xiao, S.J., Zhang, C., Zou, Q., Ji, Z.L., 2010. TiSGeD: a database for tissue-specific genes. Bioinformatics (Oxford, England). 26, 1273-1275.]
Expression enrichment (EE)^[Yu, X., Lin, J., Zack, D.J., Qian, J., 2006. Computational analysis of tissue-specific combinatorial gene regulation: predicting interaction between transcription factors in human tissues. Nucleic Acids Research. 34, 4925-4936.]
Preferential expression measure (PEM)^[Huminiecki, L., Lloyd, A.T., Wolfe, K.H., 2003. Congruence of tissue expression profiles from Gene Expression Atlas, SAGEmap and TissueInfo databases. BMC Genomics. 4, 31.].

The results of the benchmark paper (fig. 1) show that every algorithm produces a bimodal distribution identifying 3 clear groups:

0-0.2 tissue specificity (non/low specificity, housekeeping genes)
0.2-0.8 tissue specificity (intermediate specificity)
0.8-1.0 tissue specificity (high/absolute specificity)

The authors identified the best performing algorithms as tau, gini and counts. However the authors state that counts is only robust if the correct expression threshold/cut-off is chosen and gini doesn’t normalise for expression amplitude between tissues, while tau normalises for amplitude between tissues and also includes the number of tissues as part of the algorithm.

include_graphics(paste(wd,'benchmark.png',sep = ''))

The key conclusions of the benchmark were as follows:

Tau showed the strongest correlation between results regardless of tissue numbers included
Of the 9 algorithms compared in the benchmark paper, only tau avoided a bias toward classifying most genes as housekeeping and as a possible consequence, tau identified the most tissue specific genes including genes not detected by any other algorithm
Gene Ontology enrichment suggests the extra tissue specific genes found by tau were due to false negatives in other algorithms rather than false positives in tau
Additionally, the authors compared the effect of calculating tissue specificity using RPKM values or log transformed values and found:
- Using RPKM detects more tissue specificity to the extent that the 0-0.2 tissue specificity group (housekeeping genes) may be lost completely
- Using RPKM reduces correlation between both between tissue subsets and between species.

Workflow

include_graphics(paste(wd,'workflow.png',sep = ''))

The workflow (fig. 1) first log transforms the expression values provided by the user. To make cross-tissue comparisons possible, quantile normalizing is performed on the entire dataframe, after which each gene is given a bin value between 0-10 for each tissue. The higher the bin value, the greater the expression. The list of bin values of a gene is then passed to the tau algorithm to calculate it's specificity for any tissue. Finally, the specificity of each gene for each individual tissue is calculated (tau expression fraction, tef).

Input

Users run their preferred RNA-seq analysis pipeline to calculate a single expression value for each gene in each tissue (e.g. mean TPM). Currently the package supports data from mouse, human and macaque. All genes should named by their Ensembl Gene IDs (e.g. ENSMUSG00000066232). This is to allow biomaRt annotation later in the workflow. Take care to annotate all genes using the same ensembl version (all versions from ensembl 79 onward are supported). This is important to ensure no genes are excluded from the pipeline due to discontinued Ensembl Gene IDs. The package takes as input a single dataframe where the first column is the Ensembl Gene ID, each subsequent column header is a tissue name, and the cells contain the user defined expression values of all genes in all tissues.

library(tispec)
head(meanExp[,1:5], n=5)

checkInput(meanExp[,1:5])

The checkInput function evaluates the cells of the dataframe to ensure it is in the correct format i.e. The function returns errors when it finds cells that contain character strings, NAs, negative values or infinite values. In the example above, only the first five tissues are checked. The number of incorrectly formatted cells are returned for each tissue.

Normalisation

For each tissue independently, the input data should be normalised so that differences in expression between 1 gene and another within the tissue can be compared while reducing the effect of outliers on the output. A pseudo count is added to log2 transform the data, and a threshold of log2(0) is set. This means that all genes with input values below 1.0 will be defined as non-expressed. All genes not expressed in any tissue are then removed from the data. Finally, to allow comparison of gene expression across different tissues that may come from different experiments/laboratories, the log2 transformed dataframe is quantile normalised which makes the distribution of data in each tissue statistically identical to each other. Remember, now the expression values are on a log scale, so +1 difference represents a 10x difference in expression.

log2Exp <- log2Tran(meanExp)

head(log2Exp[,1:5], n=5)

qnExp <- quantNorm(log2Exp)

head(qnExp[,1:5], n=5)

Calculate Tissue Specificity

After normalisation, implementation of the tau specificity algorithm assigns each gene a tau value between 0-1.

Absolutely specific genes (ASGs):
- Genes expressed in a single tissue only
- Defined by a tau value of 1
Highly specific genes (HSGs):
- Genes with relatively highly enriched expression in a few tissues
- Defined by a tau value of at least 0.85.
Housekeeping genes:
- Genes expressed in many (or all) tissues with no enriched expression
- Defined by a tau value of at most 0.2.

The tau value only defines the "specificity" of a gene. To determine which tissue the gene is actually specific for, requires calculation of tau expression fractions (tef). As can be seen in the worked example in figure 2, gene X is a HSG with a tau value of 0.95. However, calculating tef identifies that Gene X is highly specific for tissue B (0.95 tef), with additional but much lower specificity for tissue C (0.24 tef). Targetting gene X in tissue B, may produce off-target effects in tissue C.

tauExp <- calcTau(qnExp)

head(tauExp[,1:5], n=5)

The output of 'calcTau' now begins with an extra column containing the tau value of each gene, while each tissue column is now populated with tau expression fractions.

Biomart Annotation {#getMart}

Next, using the biomaRt package, the gene name and type are added to each ensembl gene ID.

tauAnno <- getMart(x = 'mouse', y = 79, z = tauExp)

head(tauAnno[,1:5], n=5)

The dataframe output now contains all the information required for easy user interaction.

Additional Useful Functions

Specific gene distribution

plotDensity(tauAnno)

Notice the similarity between the shape of this plot and tau plotted in Figure 1 with a peak at tau>0.95. The plots are not highly similar because here a much reduced dataset is plotted (~9k genes only). Expect a distribution similar to Figure 1 when all annotated genes are included (~40k genes).

Specific gene counts

asg <- getDist(tauAnno, 1)
head(asg, n = 5)

hsg <- getDist(tauAnno, 0.85)
head(hsg, n = 5)

plotDist(tauAnno)

The distribution plot identifies tissueM and tissueO as the tissues with the highest number of absolutely specific genes (ASGs, tau = 1) and highly tissue specific genes (HSGs, tau >= 0.85)

Study a tissue of interest

Extract your tissue

tissueA <- getTissue('tissueA', qnExp, tauAnno)

head(tissueA[, -2], n = 5)

The function getTissue retrieves the quantile normalised expression and tissue specificity of every gene in a named tissue and uses this information to create a score column. Each gene's score is between 0 and 2, and is the sum of it's tau expression fraction value and it's 0-1 ranged normalised expression value. The gene with both the highest expression and specificity recieves the highest score.

Notice Scml2 is a highly specific gene (tau 0.964) but not for tissueA (frac 0.000) and is not even expressed in tissueA (qn 0.000). To identify for which tissue Scml2 is specific requires the plotGene function. However, organising the data like this allows multiple other useful functions: getOptimum, plotCorr, and getControls.

Get a highly specific, highly expressed, optimal gene set {#getOptimum}

Ranking of genes by both expression and specificity is useful for anyone working on a single tissue wanting to identify a set of genes 1) that are highly specific to the tissue, 2) that are expressed in high enough quantities (facilitates bench work in the laboratory), and 3) with minimal expression in other tissues (limits off target effects)

optimum <- getOptimum(tauAnno, tissueA, 5)
optimum$dataframe[, -2]

optimum$barplot

The plot shows the top 10 optimal genes for tissueA, and their specificity for only the tissues in which they are expressed. Tial1 is absolutely specific for tissueA (frac 1.000) while Col4a3 is highly specific for tissueA (frac 0.943). Both genes are highly expressed in tissueA, but Col4a3 is also minimally expressed in tissueI (Use plotGene to confirm). The plot shows all other tissues in which those genes are expressed so as to facilitate prediction of potential off target affect of any biological experiments (e.g. CRISPR).

Extract the gene expression values to see the high expression in tissueA, and minimal expression elsewhere:

meanExp[c(rownames(optimum$dataframe)),]

Plot correlation between specificity and expression {#plotCorr}

Visualising the relationship between specificity and correlation in a tissue helps us conceptualise the nature of tissue specific genes in a tissue. Also, with this function a gene of interest can be highlighted to see how it compares to the rest of the genes in the tissue in terms of expression and specificity.

corrPlots <- plotCorr(tissueA, c('Col4a3', 'Mboat7'))
corrPlots$tauPlot

By plotting the tau value against expression we see that as the specificity of a gene for ANY tissue increases, the expression of the gene in tissueA decreases. There is a negative correlation (r = -0.58) as reflected in the trendline. However, many tissueA absolutely specific genes (orange) still have relatively high expression in tissueA. We also see how genes of interest passed by the user are highlighted (pink): both are highly expressed and highly specific, although from this plot it is not known for which tissue they are specific because only the tau value was used.

corrPlots$fracPlot

By plotting the tau expression value against expression the trendline now shows a positive correlation (r = 0.52) suggesting that as specificity of a gene for tissueA increases, so does expression of the gene in tissueA. As the linear model may be underfitting, I also used a generalized additive model to plot the trendline (not shown) which revealed an initial strong positive correlation that plateaus at 0.1 specificity suggesting that specificity above 0.1tef is a poor indicator of expression, and vice versa. This is clearly illustrated by the wide range of expression of genes that are absolutely specific (orange) or highly specific (yellow) for tissueA. From this plot we can now see that of both genes passed by the user, only Col4a3 is highly specific to tissueA because the tau expression fractions for tissueA were used. To extract genes that are both highly specific and highly expressed use getOptimum. Try passing the results of getOptimum to this function: in what area of the scatter plot do you think those genes will be plotted?

Get a set of control genes {#getControls}

This function is useful for differentiating between two tissues. For example, to determine if a small sample of suprachiasmtic nucleus brain tissue has been contaminated with cortex brain tissue during tissue dissection. Performing qPCR on the top ranked genes returned by the function will confirm/reject contamination.

controls <- getControls(tissueA, tissueB)

head(controls$tissueA[, -2], n = 1)

head(controls$tissueB[, -2], n = 1)

In the example shown, we are trying to determine if tissueA has been contaminated with tissueB. The 2 top ranked genes returned are specific for, and highly expressed in, their respective tissue but also have less than 0.1 quantile normalised expression in the other tissue. So performing qPCR on tissueA to identify the amount of Arl11 (positive control) and Mboat7 (negative control), will confirm if tissueA has been contaminated with tissueB.

Study a gene of interest {#plotGene}

Quite often a researcher works on a specific gene rather than a tissue and may find this function useful to identify the specificity of a single gene for any and all tissues. This function can also be used to confirm the results from other functions in this package e.g. getOptimum.

plotGene(tauAnno, 'Mboat7') # gene name is case sensitive

This function is simply plotting the gene from the result of getMart:

subset(tauAnno, tauAnno$external_gene_name == 'Mboat7')

Viewing the input expression values of the gene confirms the plot output:

id <- rownames(subset(tauAnno, tauAnno$external_gene_name == 'Mboat7'))
round(meanExp[id, ], digits = 3)

roonysgalbi/tispec documentation built on May 26, 2019, 1:33 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

roonysgalbi/tispec
Calculates tissue specificity from RNA-seq data

In roonysgalbi/tispec: Calculates tissue specificity from RNA-seq data

Introduction

Workflow

Input

Normalisation

Calculate Tissue Specificity

Biomart Annotation {#getMart}

Additional Useful Functions

Specific gene distribution

Specific gene counts

Study a tissue of interest

Extract your tissue

Get a highly specific, highly expressed, optimal gene set {#getOptimum}

Plot correlation between specificity and expression {#plotCorr}

Get a set of control genes {#getControls}

Study a gene of interest {#plotGene}

R Package Documentation

Browse R Packages

We want your feedback!

roonysgalbi/tispec Calculates tissue specificity from RNA-seq data

In roonysgalbi/tispec: Calculates tissue specificity from RNA-seq data

Introduction

Workflow

Input

Normalisation

Calculate Tissue Specificity

Biomart Annotation {#getMart}

Additional Useful Functions

Specific gene distribution

Specific gene counts

Study a tissue of interest

Extract your tissue

Get a highly specific, highly expressed, optimal gene set {#getOptimum}

Plot correlation between specificity and expression {#plotCorr}

Get a set of control genes {#getControls}

Study a gene of interest {#plotGene}

R Package Documentation

Browse R Packages

We want your feedback!

roonysgalbi/tispec
Calculates tissue specificity from RNA-seq data