# Default RMarkdown options
# Changing these will change them for all chunks unless otherwise set
knitr::opts_chunk$set(
                        autodep        = TRUE,
                        cache          = TRUE,
                        cache.comments = TRUE,
                        collapse       = TRUE,
                        comment        = "##",
                        #dev            = "png",
                        echo           = TRUE,
                        error          = FALSE,
                        fig.width      = 7,
                        fig.height     = 5,
                        fig.align      = "center",
                        highlight      = TRUE,
                        include        = TRUE,
                        message        = FALSE,
                        prompt         = FALSE,
                        results        = "markup",
                        size           = "normalsize",
                        strip.white    = TRUE,
                        tidy           = FALSE,
                        tidy.opts      = NULL,
                        warning        = FALSE
               )
![RNAtools](logo.png)

Welcome!

Welcome to RNAtools! Many studies across the life sciences now make use of RNA-seq to investigate gene products. These experiments can provide high-quality measurements of gene expression across a wide dynamic range, and in contrast to microarray experiments, do not require probes to be designed for specific regions.

Perhaps the most common analysis pipeline for RNA-seq data is to perform quality control, align reads to a reference genome, summarise reads against a known annotation and perform differential expression testing. The aim of this process is to identify genes or transcripts that have significantly different activity levels under some treatment condition (disease, drug, knockout, environment...) compared to a control. Further analysis attempts to investigate the biological implications of these changes in regulation.

Multiple methods exist for each of these steps. This package takes a matrix of summarised counts for a group of samples and provides a unified interface for differential expression testing (using edgeR, DESeq, DESeq2 and limma-voom). The aim is allow simultaneous analysis using multiple packages including robust quality control, easily comparable visulisations and combination of results. The flow diagram below shows the basic workflow for RNAtools.

RNAtools workflow

Before we get started let's load the RNAtools package:

library("RNAtools")

Data

For this vignette we are going to make use of data from Sultan et al. available from the HTSFilter package. This dataset contains two biological replicates from a human embryonic kidney and B cell line. The raw read count and phenotype table are available as part of the ReCount online resource.

We first load HTSFilter, then we can start by attaching the data in sultan. Other information is available but we are just interested in the matrix of counts and the groups for each sample:

library("HTSFilter")

data("sultan")

counts <- exprs(sultan)
groups <- pData(sultan)$cell.line

rm(sultan)

groups

head(counts)

The matrix of counts and the factor specifying the group of each sample are going to be the main input for our analysis. The groups factor should always be ordered so that the reference group is the lowest level.

Exploration

Before starting our analysis we should begin by exploring our data. This allows us to spot anything that is obviously wrong and provides a baseline we can make comparisons to later.

Raw Counts

Let's have a look at the raw count density.

countDensity(counts)

Most of the plotting functions in RNAtools return ggplot2 objects, with the exception of those that produce Venn diagrams. These can be easily modified, for example if we prefer the black and white theme and would like to add a title:

countDensity(counts) + ggplot2::theme_bw() + ggplot2::ggtitle("Count Densities")

Here we can see that the distribution is both over a large range and heavily skewed towards low values. This can be corrected by using a transformed verion of the counts.

Transformed Counts

The simplest transformation is to take a log of the counts ($K$), after adding one to avoid log of zero.

$$ \begin{aligned} y = log_2(K + 1) \end{aligned} $$

More sophisted transformations are provided by the differential expression packages such as the Variance Stabilising Transformation (DESeq), regularised-log (DESeq2) and log Counts Per Million (edgeR). RNAtools is capable of performing each of these.

transformed <- transformCounts(counts,
                               methods = c("log", "vst", "rlog", "logCPM"))

names(transformed)

head(transformed$log)

The result of this function is a list of matrices where each is transformed using one of the specifed methods. We can now use the listDensity function to plot each of these. Again the result is a list but now each item is a ggplot2 object containing a density plot. An additional item has been added which holds the combined plot.

densities <- listDensity(transformed)

names(densities)

densities$combined

Boxplots

Similar functions exist for producing boxplots.

boxplots <- listBoxplots(transformed)

boxplots$combined

Both sets of plots can be helpful for exploring a dataset. The boxplots clearly show the expression quartiles and the presence of outlier genes. The density plots give more detail about the distribution, for example showing any bi-modality. It is also possible to see the effect of DESeq2's regularised log transformation as it removes the peak at lower expression levels.

MA Plots

Pairwise MA plots show the difference in expression for genes in two samples compared to the average expression of the two samples.

ma.plots <- listCountMA(transformed)

ma.plots$combined

Here we can see the reproducability between samples and whether normalisation is needed. Genes that are similar in the two samples appear around the $y = 0$ line (blue) and a lowess fit (red) shows trends in bias related to mean expression.

With more samples the numbers of these plot grows quickly and it may be more useful to select a single transformation or some samples of interest.

countMA(transformed$log)

For this dataset the results look pretty good. The plot are centered around $y = 0$ and the samples from the same cell-types have reduced variance.

Heatmaps

To explore the similarites and dissimilarities between samples looking at a clustered heatmap of the between sample distances can be helpful. The countHeatmap function (and the related listHeatmaps) can be used to produce these plots. By providing the vector of groups these can be included as labels on the y-axis. As dendrograms indicating the clustering are included the result is a list of ggplot2 objects instead of a single plot. The showHeatmap function can assemble these into a single image.

heatmap <- countHeatmap(transformed$log, groups = groups)

names(heatmap)

showHeatmap(heatmap)

PCA and MDS

Principal component analysis (PCA) can also be useful for visualising the similarities between samples, in particular identifying outliers and batch effects. PCA reduces the dimensionality of a dataset to identify the directions with greatest variance. Plotting samples onto the reduced dimensions allows quick comparisons.

PCA <- listPCA(transformed, top = 500, groups = groups)

PCA$combined

Since PCA can be problematic with very high-dimensional data we select the top 500 most variable genes, and provide groups to colour the labels. As we are looking for differentially expressed genes we hope to see a clear separation between the two groups, as is the case along PC1.

Multi-dimensional scaling (MDS) provides an alternative to PCA that aims to visually represent the distances amongst a set of samples. Genes can be selected based on "common" deviations across all samples (similar to what was done with PCA) or "pairwise" between each pair of samples.

MDS <- listMDS(transformed, top = 500, group = groups, selection = "pairwise")

MDS$combined

In this simple example the PCA and MDS plots are highly similar but with more complex datasets it may be worthwhile examining both.

Objects

Each differential expression package makes use of its own objects. A function is provided to easily convert a matrix of counts into the appropriate objects for multiple packages, returned as a list.

objects <- counts2Objects(counts, groups, filter = TRUE,
                          methods = c("edgeR", "DESeq", "DESeq2", "voom"))

names(objects)

for(object in objects) {
    print(class(object))
}

NOTE: It is important that the names of this list are not changed as they will be used by following functions to identify which method to apply.

A Note on Filtering

Many genes in any RNA-seq dataset are uninformative simply because they have zero or very few reads. These genes don't affect the results immediately as they will the differential expression tests, however they can affect how other genes are treated. More genes means more tests which affects the procedures used to correct for multiple testing. By removing low-expression genes we reduce the number of tests and possibly increase the number of genes found to be significantly differentially expressed.

The HTSFilter package provides a method for independently filtering genes in datasets with biological replicates. There are several stages of the process where filtering could theoretically be applied. By setting the filter option in counts2objects and related functions we conduct filtering when suggested by the HTSFilter authors. Unfortunately the appropriate stage is different for each method. For example for voom no methods for normalised datasets exist, so the most appropriate stage is when constructing objects from the raw counts.

nrow(counts)

nrow(objects$voom$counts)

HTSFilter can produce a plot that shows how the filtering threshold is chosen. Basically it attempts to find that a value that removes the same number of genes from each group of samples. To avoid cluttering the analysis this plot is suppresed but a message showing the chosen threshold is printed to the console.

If you prefer not to use HTSFilter (for example if you don't have biological replicates) the authors of edgeR suggest an alternative approach where genes are removed if they have CPM less than one in at least the number of samples as the size of the smallest group. In addition DESeq2 has it's own independent filtering procedure. If filter is set to TRUE at the testing stage this will be turned off as HTSFilter is applied at the normalisation stage.

Normalisation

Each package has its own normalisation procedures. Here we use the listNorm function to normalise our list of objects.

normalised <- listNorm(objects)

Please read the documentation for the packages you plan to use to properly undstand what is been done in this process.

Testing

Once the data has been normalised we can test for differential expression. We first need to assign the groups we would like to test as group1, the reference or control, and group2 the treatment. The control group should always be whichever group was defined as the lowest level of the groups factor. Again a list function allows us to perform all methods simultaneously.

group1 <- "HEK293T"
group2 <- "Ramos B cell"

tested <- listTest(normalised, group1, group2, filter = TRUE)

Each method produces results in it's own format. In order to make them easier to combine we convert them to a regular format which can be used by plotting and other functions.

results <- listRegularise(tested)

head(results[[1]])

This format has five columns:

  1. Gene - the gene name or identifier
  2. FoldChange - the estimate of log fold-change
  3. Abundance - the estimate of average expression across samples
  4. Significance - the adjusted p-value
  5. pValue - the unadjusted p-value

P-values

Before we examine the results we should first inspect the (unadjusted) p-values. The distribution of p-values should consist of a uniform distribution according to the null hypothesis and a peak at low p-values for significant results that reject the null. We can easily look at the distribution of our results.

pval.plots <- listPlotPvals(results, alpha = 0.05)

pval.plots$combined

Our results seem to have been computed correctly. We see a consistent rectangle across most p-values with significant peaks at low values suggesting that there are many differently expressed genes. Here the shaded red area shows p-values less than 0.05. If you were to see a different pattern, such as hill or u-shaped that may suggest a problem with the assumed variance of the null distribution. For methods such as DESeq2 and voom that report test statistics as well as p-values this can be corrected using the package fdrtool. From here on we only discuss p-values corrected for multiple testing.

MA Plots

We can now produce MA plots of our results.

results.ma <- listResultsMA(results, alpha = 0.05)

results.ma$combined

Here we plot log average expression level against the log fold-change estimated by each method. The genes with adjusted p-values greater than $alpha$ are coloured in red, the blue line indicates log fold-change equal to zero and the shaded area log fold-change with magnitude less than two.

The individual packages have their own functions for creating these plots. In particular edgeR's plotSmear may be worthwhile examining as it adds some additional features.

genes.de.names <- rownames(tested$edgeR)[tested$edgeR$FDR < 0.05]
edgeR::plotSmear(normalised$edgeR, de.tags = genes.de.names)

Volcano Plots

An alternative to MA plots are volcano plots which show log fold-change against significance. The blue shaded area shows absolute log fold-changes less than two and the red area significance levels less than 0.05. Grey points indicate genes that returned significance of zero (top), or were missing p-values (bottom).

volcanos <- listVolcano(results)

volcanos$combined

Voom often returns much lower significance levels so it can be preferable to look at that plot separately.

volcanos$voom

Combining Results

Once we are satisfied that the results from each method look good we can begin to compare them. The Jaccard index can be used to compare two lists and is calculated as the number of items in the intersection divided by the number in the union.

jaccardTable(results, alpha = 0.05)

From the number of genes identified by the methods we can see that edgeR and voom have called more genes as significantly differentially expressed. The indices suggest that the two versions of DESeq are more similar to each other than the other two methods. Another way to view the lists is via a Venn diagram. Unfortunately ggplot2 is not (easily) capable of these so we make use of another excellent package, VennDiagram, which returns a grid graphics object.

venn <- geneVenn(results, alpha = 0.05)

plot.new()
grid::grid.draw(venn)
text(0.5, 1, "Comparison of Lists", vfont = c("serif", "bold"), cex = 2)
text(0.5, 0, "Number of Differentially Expressed Genes",
     vfont = c("serif", "plain"), cex = 1.5)

We can see that the majority of genes were identifed by all methods and that those identified by DESeq and DESeq2 are almost a subset of those from the other methods.

To extract the lists of genes from each segment of the Venn diagram we can use the vennGenes function. The result is a list of vectors containing gene names. The name of each item in the list describes the segment and the vector contains the genes that are in that segment but not anywhere else. The lengths should be the same as the numbers in the Venn diagram.

venn.genes <- vennGenes(results, alpha = 0.05)

sapply(venn.genes, length)

The segment lists can be combined to get whichever regions are of interest. For example if we wanted the total list of genes.

all.genes <- Reduce(union, venn.genes)

length(all.genes)

We can also summarise the results by gene. This tells us the mean and variance for log fold-change, p-value and adjusted significance as well as the number of methods that found the gene to be differentially expressed at a given threshold. Columns beginning with 'f' are finite summaries, ignoring infinite values produced by some of the methods.

gene.summ <- geneSummary(results, alpha = 0.05)

head(data.frame(gene.summ))

As well as providing a summary this table gives a convenient way to filter out genes of interest. For example we can select those that were identified as differentially expressed by two or more methods and are up-regulated.

gene.summ.de <- gene.summ[gene.summ$DECount >= 2, ]
gene.summ.de.up <- gene.summ.de[gene.summ.de$meanFC >= 0, ]

head(data.frame(gene.summ.de.up))

Gene Sets

Often there are sets of genes which are of particular interest such as KEGG pathways or those known to be associated with a particular process, response or condition. RNAtools provides functions for easily extacting and summarising such sets. Here we give an example using six randomly selected sets. We also need to supply the results from differential expression testing and a vector of selected differentially expressed genes.

set.seed(1)
gene.sets <- list(Set1 = sample(gene.summ$Gene, 23),
                  Set2 = sample(gene.summ$Gene, 47),
                  Set3 = sample(gene.summ$Gene, 86),
                  Set4 = sample(gene.summ$Gene, 63),
                  Set5 = sample(gene.summ$Gene, 111),
                  Set6 = sample(gene.summ$Gene, 59))

setTable(results, de.set = gene.summ.de$Gene, gene.sets = gene.sets)

If we are interested in specifically which genes are up or down-regulated in a set we can examine a plot of fold change-estimates. The estimates from each method are shown with different shapes coloured from purple (lowly significant) to red (highly significant). The mean is shown as orange stars, and the shaded area indicates log fold-change from -2 to 2. Let's look at the genes that are differentially expressed by at least two methods from Set2.

plotFoldChange(data.list = results,
               gene.set = intersect(gene.sets$Set2, gene.summ.de$Gene))

For most genes we can see that the fold-change estimates are similar, however this is not the case at the extreme fold-change levels. Here the spread is wider and DESeq produces infinite estimates. This effect is likely due to these genes having very low or zero counts in one of the conditions. We can confirm this by looking at the counts of the genes with lowest and highest fold-changes.

counts[c("ENSG00000134802", "ENSG00000131094"), ]

Quick Analysis

If you are just in interested in quickly getting results here is an example of the steps using the Pasilla dataset. This section requires you to install the pasilla and Biobase packages.

if (requireNamespace("pasilla", quietly = TRUE) &&
        requireNamespace("Biobase", quietly = TRUE)) {

    data("pasillaGenes", package = "pasilla")

    group1 <- "untreated"
    group2 <- "treated"

    counts <- counts(pasillaGenes)
    groups <- factor(pData(pasillaGenes)[, c("condition")])
    groups <- relevel(groups, ref = "untreated")

    objects <- counts2Objects(counts, groups = groups, 
                              methods = c("edgeR", "DESeq", "DESeq2", "voom"),
                              filter = TRUE)
    normalised <- listNorm(objects)
    tested <- listTest(normalised, group1 = group1, group2 = group2,
                       filter = TRUE)
    results <- listRegularise(tested)

    venn <- geneVenn(results, alpha = 0.05)

    plot.new()
    grid::grid.draw(venn)
    text(0.5, 1, "Comparison of Lists", vfont = c("serif", "bold"), cex = 2)
    text(0.5, 0, "Number of Differentially Expressed Genes",
         vfont = c("serif", "plain"), cex = 1.5)
}

Session Info

sessionInfo()


lazappi/RNAtools documentation built on May 20, 2019, 8:26 p.m.