In this vignette we will use package ExpCube to explore a small dataset of 16 samples. We will aim to learn about how sequencing depth affects performance of differential expression analysis. We will start with raw expression data and finish with a visualization.
We will use a dataset embedded into the package that was prepared as follows. Expression profiles of HAP1 cells were first measured using the QuantSeq protocol in 24 shallow replicates. This data was cleaned, pooled together, and then split into 16 bins to create synthetic samples mimicking sequencing runs of various depth (number of reads). Reads in each bin were aligned using GSNAP and expression quantified using Exp3p.
We first need to load the package (if you need to install the package, uncomment the first two lines)
## library("devtools") ## install_github("tkonopka/ExpCube") library("ExpCube")
We will also need some other packages (from github and CRAN)
library("Rcssplot") library("data.table")
We will need two preliminary tables, which we will load from csv/tsv files stored in the extdata
directory associated with the package. The filenames are
summary.file = system.file("extdata", "PlateSeries.longsummary.csv", package="ExpCube") readcounts.file = system.file("extdata", "PlateSeries.readcounts.tsv", package="ExpCube")
First, let's look at the sample definitions
series.summary = E3LoadPlateConfig(summary.file) samplenames = series.summary[,"Sample"]
Here, the E3LoadPlateConfig
is just a wrapper for read.table
, but it is convenient if you've got multiple configuration files.
Next, let's get an object linking each sample to a sequencing depth (number of reads), and merge this with the previous table,
readcounts = read.table(readcounts.file, header=T, stringsAsFactors=F) series.summary = merge(series.summary, readcounts, all.x =TRUE) rownames(series.summary) = series.summary[,"Sample"] series.summary = series.summary[samplenames,]
Here is a preview of the result
head(series.summary[, c("Sample", "Plate.96well", "Reads")], n=4)
(We will not need the other columns of the table)
To start the analysis, we need to load expression profiles for our samples. These are stored in files in the extdata
directory of the package with filenames starting with the sample name and ending with a suffix exp3p.gene.txt.gz
. We will reach them by specifying a path template,
path.template = dirname(system.file("extdata", "PlateSeries.longsummary.csv", package="ExpCube")) exp.template = paste0(path.template, "/PlateSeries/SAMPLE.exp3p.gene.txt.gz")
The text SAMPLE
is here a placeholder; individual samplenames will be substituted into this template one by one. (A second allowed placeeholder is PLATE
, which is convenient for larger datasets wherein samples are split by library preparation plate. ) We can now load the entire dataset with one command
seriesdata = E3LoadExp3pGeneData(series.summary[,"Sample"], series.summary[,"Plate.96well"], exp.template)
The output object now contains all the expression profiles and expression intervals for the entire dataset. As an illustration, let's look at gene "MDM2",
MDM2.id = which(seriesdata$expression[,"Gene"] == "MDM2") seriesdata$expression[MDM2.id, 1:4]
We can see that the expression of sample 1 and sample 3 differ by a factor of 1.2. Let's keep that in mind as we peak into the low and high expression estimates for the same samples,
seriesdata$expression.high[MDM2.id, 1:4] seriesdata$expression.low[MDM2.id, 1:4] ```` We see that the high estimate for sample 1 is slightly lower than the low estimate for sample 3; the two intervals do not overlap. This is suggestive that perhaps that the difference in expression for these two samples may be significant. ### Adjusting expression intervals The expression intervals in the `seriesdata` object are currently the ones obtained from the raw data files. (We loaded raw data from [exp3p](https://github.com/tkonopka/Exp3p) data files, so these intervals are based on a Poisson model of read counts. If you use data produced by other programs, the interpretation might be different). For various reasons, we might want to adjust these intervals. For example, we might want to widen intervals for genes that we believe are prone to batch effects. In this vignette we will not look at how to select such genes or how to decide how strong such adjustments should be. We will just load a set of adjustments from a data file ```r rescale.file = system.file("extdata", "KiCube.rescaling.factor.discovery.csv", package="ExpCube") rescale.data = read.table(rescale.file, header=T, stringsAsFactors=F) gene.rescaling = rescale.data[, "rescale.factor"] names(gene.rescaling) = rescale.data[,"Gene"] gene.rescaling["MDM2"]
Here, we see the suggested rescaling factor for gene MDM2 is around 5. We can now apply this rescaling to our samples
seriesdata.rescaled = E3RescaleIntervals(seriesdata, samplenames, gene.rescaling)
Let's look at the effect of this operation on MDM2,
seriesdata.rescaled$expression[MDM2.id, 1:4] seriesdata.rescaled$expression.low[MDM2.id, 1:4] seriesdata.rescaled$expression.high[MDM2.id, 1:4]
Note how the base expression values are unchanged, but the interval are much wider than before. There is clear overlap between the intervals for samples 1 and 3. This is suggestive that the difference in expression may not be important/reproducible/significant after all.
Let's use this dataset to learn how sequencing depth affects ability to perform expression analysis. To this end, let's create an auxilary dataset with additional samples. We will create these samples by shifting the profiles of our existing samples by a factor of 2. That is, the new samples will model an over-expression by a factor of 2.
seriesdata.OE = seriesdata.rescaled for (nowsamp in samplenames) { nowvals = seriesdata.OE$expression[,nowsamp] for (k in c("expression", "expression.low", "expression.high")) { seriesdata.OE[[k]][,paste0(nowsamp, "_2")] = seriesdata.OE[[k]][,nowsamp]+nowvals; } rm(nowvals) }
We will need a function to call differential expression. Let's use function E3DEscore
from the package. It takes two primary arguments, a fold-change capturing effect size and a z-score capturing significance. It outputs a value in [-1, 1] scoring differential expression. (The function can be tuned through additional parameters, see documentation). Here are some examples of scores obtained through default settings
E3DEscore(fc=c(2, 2, 1.5, 1.5, 0.5), z=c(2, 1.5, 2, 1.5, -2))
Now, let's compute the number of expressed genes in each sample (number of genes with expression above a threshold). At the same time, let's also compute the number of differentially expressed genes between our samples and the matching auxilary samples (number of genes with DE score above threshold)
series.summary[,"ExpressedGenes"] = 0; series.summary[,"useDE"] = 0; min.expression = 1; min.DEscore = 0.7; for (nowsamp in samplenames) { nowexpressed = seriesdata.OE$expression[,nowsamp]>min.expression; series.summary[nowsamp, "ExpressedGenes"] = sum(nowexpressed) temp = E3GetDEScores(seriesdata.OE, sampleA=paste0(nowsamp,"_2"), refsamples=nowsamp, scorefun=E3DEscore, shiftval=min.expression/100, rankrescale=FALSE) series.summary[nowsamp, "useDE"] = sum(abs(temp)>min.DEscore & nowexpressed) rm(nowexpressed, temp) }
The results are now encoded into new columns of the series.summary
table.
series.summary[,c("Sample", "Reads", "ExpressedGenes", "useDE")]
The number of genes detected as "expressed" is roughly constant, around 12,000, for all samples regardless of sequencing depth. The number of genes that are detected as differential expressed, in contrast, rises steadily with sequencing depth (reads).
As a final component of this vignette, let's visualize the modeling results. Let's make a line chart showing the two series we just computed.
myRcss = Rcss(system.file("extdata", "ExpCube.Rcss", package="ExpCube")) RcssSetDefaultStyle(myRcss) E3PlotLines(series.summary, xcolumn="Reads", ycolumns=c("ExpressedGenes", "useDE"), lineclass=c("EX", "DE"), legend=c("Expressed genes", "Genes detectable at FC=2"), legend.at = c(8,2), xlab="Depth (reads)", ylab="Genes", main="Power of shallow sequencing", Rcss=myRcss, Rcssclass="plotline")
The first group of statements defines a plotting style used by Rcssplot. The last command should display a graph with two series, including smoothed lines.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.