devtools::load_all()
This project documents everything around the project on Alternative Splicing
that I am working on @ FGCZ.
We start with doing feature counts in bam-files using App EzAppFeatureCounts
(file app-featureCounts.R) from package ezrun
. The run method ezMethodFeatureCounts
of reference class EzAppFeatureCounts
basically calls function Rsubread::featureCounts
.
Rsubread::featureCounts
To be able to run a first analysis that counts features in bam-files, we have to understand what input arguments are required by Rsubread::featureCounts
. This function is described in section 3 of the Rsubread
vignette. From the vignette, it seams that all we have to give to function Rsubread::featureCounts
is a bam-file and an annotation file.
For the version of Rsubread::featureCounts
that is wrapped into EzAppFeatureCounts
, we have to specify some more details. Most of the parameters can be inferred by their appearance in the call to Rsubread::featureCounts
and by a trackback into the users guide. Table 3 in Section 6 of that userguide gives an overview over all parameters that can be specified when calling Rsubread::featureCounts
. The ones that we specified are shown in the following table
Arguments <- c('annot.ext', 'gtfFeatureType', 'featureLevel', 'allowMultiOverlap', 'paired', 'strandMode', 'minMapQuality', 'minFeatureOverlap', 'keepMultiHits', 'countPrimaryAlignmentsOnly') Description <- c("Annotation oder feature file", "Type of feature, such as exon. Only features provided in annotation will be counted", "Feature level such as gene or transcript", "Should reads be allowed to be assigned to more than one meta-feature", "Fragments are counted instead of reads", "unstranded, i.e., both = 0, sense = 1, or antisense = 2", "minimum quality from mapping to be counted", "minimum number of bases overlapping between read or fragment and feature", "count also reads with multiple mapping locations", "only primary alignments will be counted") knitr::kable(data.frame(Arguments,Description))
The arguments shown above were all simple to specify in EzAppFeatureCounts
. The only parameter that was difficult was the annotation file. The problem is related to the fact that one cannot just specify the path to the annotation file, but one has to specify the location where the annotation file is inside an S4 object of class "EzRef". Have a look at the help file for "EzRef" using
?EzRef
The example section of that help file gives some hints on how one has to come up with an acceptable S4 "EzRef"-object. For our example this works as follows
sRefFeatureFile <- "/srv/GT/reference/Mus_musculus/Ensembl/GRCm38/Annotation/Version-2014-02-25/Genes/genes.gtf"
Given that the complete path to the annotation file is as shown above, we have to extract the part between the organism name and the Version info into an object called refBuild
.
sOrganismName <- "Mus_musculus" vecFeatPath <- unlist(strsplit(sRefFeatureFile, "/")) nRefBuildStartIdx <- which(vecFeatPath == sOrganismName) nRefBuildEndIdx <- grep("^Version", vecFeatPath) refBuild <- paste(vecFeatPath[nRefBuildStartIdx:nRefBuildEndIdx], collapse = "/") cat(" * RefBuild: ", refBuild, "\n")
The reference feature file is then constructed using
param <- ezParam(list(refBuild=refBuild))
The expression param$ezRef@refFeatureFile
then returns the path to the feature file.
Now that the program produces results, we want to better understand those results. Therefore, we have to have a look at the methods behind Rsubread
. The methods used in Rsubread
is described in its usersguide and also in a paper by @Liao01042014.
In chapter 8 of the usersguide that comes with the Rsubread
package, a case study for comparing RNA-seq data from different tissue in Humans (brain vs. universal reference) is described. In what follows we try to replicate that for our data.
Once that we have our feature counts, we are ready to analyse whether we can quantify any differential usage between different experimental units. This is done using the BioC package DEXSeq
. In what follows, we are trying to understand and to reproduce the material in the vignette of DEXSeq
(see https://bioconductor.org/packages/3.3/bioc/vignettes/DEXSeq/inst/doc/DEXSeq.pdf).
The vignette of DEXSeq
uses the pasilla dataset, an RNA-Seq dataset generated by @Brooks01022011.
Annotation/feature files must be in GTF format and are preprocessed into so-called GFF files. This is done by python scripts outside of R. We try to understand the preprocessing of the annotation files and try to come up with an alternative route how to construct these GFF files.
Feature counting for the dataset used in the vignette of DEXSeq
is also done with python. In our case we have done feature counts with package RSubread
. On more quesiton will be how we have to convert our counts into the formats of the counts that are expected by DEXSeq
.
The rest of the analysis is done in R. The data processed count data is available as package pasilla
on BioC.
NB: On http://fgcz-rstudio.uzh.ch/ I installed the missing package pasilla
to a library that is inside my account.
.libPaths(new = "/home/petervr/lib/R/library") source("https://bioconductor.org/biocLite.R") biocLite("pasilla")
The datafiles that come with the package are
inDir <- system.file("extdata", package="pasilla") list.files(inDir) countFiles <- list.files(inDir, pattern="fb.txt$", full.names=TRUE) (flattenedFile <- list.files(inDir, pattern="gff$", full.names=TRUE))
The first step in the analysis is to create the so-called sample table. This table relates samples to experimental conditions. In the vignette this is constructed on the fly
sampleTable <- data.frame( row.names = c( "treated1", "treated2", "treated3", "untreated1", "untreated2", "untreated3", "untreated4" ), condition = c("knockdown", "knockdown", "knockdown", "control", "control", "control", "control" ), libType = c( "single-end", "paired-end", "paired-end", "single-end", "single-end", "paired-end", "paired-end" ) ) print(sampleTable)
Our sample table has the samples as rownames and the covariates as columns. In the example in the vignette there are two covariates, (1) the condition and (2) the library type, as it seams that the samples are not all generated with the same sequencing method. So far it is important that there must be at least one covariate and its name has to be condition
. This is the main criterion for differentiating the samples.
From the information, we have collected so far, a DEXSeqDataSet object can be constructed using the function DEXSeqDataSetFromHTSeq
as follows
sDxdFn <- file.path(system.file(package = "AlternativeSplicing", "inst", "extdata", "pasilla"), "dxd.Rd") bRunAnalysis <- TRUE if (file.exists(sDxdFn)) { suppressPackageStartupMessages( library( "DEXSeq" ) ) load(file = sDxdFn) bRunAnalysis <- FALSE }
suppressPackageStartupMessages( library( "DEXSeq" ) ) dxd <- DEXSeqDataSetFromHTSeq( countFiles, sampleData=sampleTable, design= ~ sample + exon + condition:exon, flattenedfile=flattenedFile )
if (!file.exists(sDxdFn)) { save(dxd, file = sDxdFn) }
The above function takes four arguments.
In order to keep the runtime of the vignette small, only a subset of the genes is considered
genesForSubset = read.table( file.path(inDir, "geneIDsinsubset.txt"), stringsAsFactors=FALSE)[[1]] dxd = dxd[geneIDs( dxd ) %in% genesForSubset,]
The DEXSeqDataSet
class is derived from the DESeqDataSet
class. Hence all accessor functions for column data and row data are available here. Each row of the DEXSeqDataSet
contains in each column the count data from a given exon labelled with this
as well as the count data from the sum of the other exons belonging to the same gene, labelled with others
. Annotation and column information is specified in colData
.
colData(dxd)
nRowCount <- 5
Accessing the first r nRowCount
rows by doing the following
<<DefRowCount>> head( counts(dxd), nRowCount )
Notice, there are seven samples and 14 columns in the above count data. The first seven columns correspond to counts of the specific exon and the remaining seven are the counts of all other exons in that gene. The mapping of the columns to the counts can be seen using
split( seq_len(ncol(dxd)), colData(dxd)$exon )
Accessing the first r nRowCount
rows of the counts belonging to the exonic region ('this') can be done by
head( featureCounts(dxd), 5 )
To see the details about the counting bins, we can use the rowRanges()
function
head( rowRanges(dxd), 3 )
The sample table can be braught back using function sampleAnnotation()
sampleAnnotation( dxd )
dxd = estimateSizeFactors( dxd )
dxd = estimateDispersions( dxd ) plotDispEsts( dxd )
In order to get started with a DEXSeq analysis with our own data, we need to do two preparatory data transformation.
The second step is done with the python scripts as indicated in the vignette.
The task now is to transform the count data and the existing GTF annotation into a format that is accepted by DEXSeq
. The first step is to get all arguments for the function DEXSeqDataSetFromHTSeq
ready. The two arguments design
which specifies the contrasts and the sampleData
which links exerimental factors to samples, are relatively straight forward to come up with.
The count files and the flattened annotation files need more work to come up with. The count files contain for each sample two columns with the Exon-Ids in the first column and the counts in the second column. In principle RSubread produces such count files, but the problem is that the exon-ids are special in the form that is required by DEXSeq
. DEXSeq
has its own Exon-Id system where an ID is composed of the gene-id and an internal Exon-number. These two tokens are pasted together, separated by a colon (':').
One way of producing such count files with the special DEXSeq
-version of exon-ids is to extend the existing GTF-file by an additional identifier which has the format that is required by DEXSeq
. From that, we can then produce a GFF file as it is specified in Appendix C.
Data preparation is done as described in the DEXSeq
vignette. Here, we follow closely the steps described in secions 2.3ff of the vignette (https://bioconductor.org/packages/3.3/bioc/vignettes/DEXSeq/inst/doc/DEXSeq.pdf). The data preparation is done with Python scripts which are based on the Python package HTSeq
.
DEXSeq
works with annotation files in GFF format. Hence any existing annotation files in GTF format must be converted into the GFF format. This conversion is done using the python script dexseq_prepare_annotation.py
that comes with package DEXSeq
. On the command line, in the directory where the original GTF file is, the conversion is done as follows
python \ /misc/ngseq8/stow/R-3.2.2/lib/R/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py \ genes.gtf \ genes_DEXSeq.gff
Once the GFF file is generated, we can do the counting with the second python script that comes with DEXSeq
. According to the DEXSeq
vignette, the GFF file and the SAM- file with the reads must be in the current working directory. But the counting does also work when creating a link from the current working directory to the GFF file.
The bam files are converted to sam files using samtools as shown below.
samtools view -h -o Colon1.sam Colon1.bam
Creating the link and starting the counting is done as follows
ln -s /srv/GT/reference/Mus_musculus/Ensembl/GRCm38.PatchesSkipped/Annotation/Version-2015-06-25/Genes/genes_DEXSeq.gff.20160219 genes_DEXSeq.gff python /misc/ngseq8/stow/R-3.2.2/lib/R/library/DEXSeq/python_scripts/dexseq_count.py \ genes_DEXSeq.gff \ Colon1.sam \ Colon1.txt
The results from the counting are stored in the file Colon1.txt
. This has the required format of GeneId:InternalExonId Count
.
The conversion of the annotation files from gtf to gff format and the counting using the python-scripts from the DEXSeq package is done in ezApp "EzAppDEXSeqCounting". The run method of the reference class is called "ezMethodDEXSeqCounting". This method does two steps, first the gtf annotation file is converted into gff format, in case there is no gff file available. Once a gff annotation file exists, the counting is done for all samples.
Since we are doing annotation file format conversion and exon counting with the python script that are coming with the DEXSeq package, we have to manage to run python scripts from inside of the ezApp, i.e., from inside of R. While the above mentioned commands run from the command line, they are producing an error when used from inside of R. The error tells that the "HTSeq" package cannot be found which is most likely due to a missing PYTHONPATH. Looking at the global variable files of package "ezRun", we get more insight into how to run python scripts from inside of R. The trick is to define a string which contains the PYTHONPATH definition and the path to the python executable. Surprisingly enough, the assignment of the PYTHONPATH and the path to the python executable must be separated by a space and not by a semi-colon (";"). Hence for our example the required assignment in the global variables file looks as follows
HTSEQ_PREFIX='PYTHONPATH="/usr/local/ngseq/lib/python/:/usr/local/ngseq/lib/python2.7:\ /usr/local/ngseq/lib/python2.7/dist-packages" /usr/local/ngseq/bin/python'
Inside of EzAppDEXSeqCounting, calls to the python scripts are composed of the above defined prefix, the absolute path to the script name which depends on system.file(package = "DEXSeq", "python_scripts")
and the arguments to the scripts, like input read files, annotation files and output files.
Now that we have the count data from the reads, we proceed as shown in the DEXSeq vignette with the pasilla
dataset.
The first step is to create a sample table for the data that we want to analyse. The sample table links samples to covariates. In contrary to the vignette where there were two covariates, in our data here, we just have one covariate which corresponds to the tissue type. Since the minimal requirements for the sample table is to have at least one covariate which is called "condition", the tissue type covariate is assigned to column "condition"
sampleTable <- data.frame( row.names = c("Colon1", "Colon3","Colon5", "SI1", "SI3","SI5"), condition = c("Colon", "Colon","Colon", "SI","SI","SI") ) print(sampleTable)
As soon as the four inputs
are ready, we can construct a DEXSeqDataSet
object which is the primary container of a DEXSeq analysis. A DEXSeqDataSet
is obtained from the function DEXSeqDataSetFromHTSeq()
which takes the above listed four arguments as input.
For our analysis, the command looks as follows
dxd <- DEXSeq::DEXSeqDataSetFromHTSeq( countFiles, sampleData = sampleTable, design = ~ sample + exon + condition:exon, flattenedfile = sRefFeatGff )
The remainder of the analysis is all about extending the DEXSeqDataSet
object.
Different samples are sequenced with different depth. To adjust for such coverage bias, we have to estimate size factors which measure relative sequencing depth. Normalisation in DEXSeq
is done the same way as in DESeq and in DESeq2 using the function estimateSizeFactors()
. The resulting size factors can be seen by
dxd@modelFrameBM$sizeFactor
Doing a histogram of these factors shows that they are the same for all exons from the same sample.
To test for differential exon usage, variability in the data must be estimated. It is important to be able to separate technical and biological variation (noise) from variation that is caused by experimental conditions. Noise is typically assessed from the biological replicate measures in the dataset. In RNA-Seq experiments, the number of replicates is typically to small to estimate variation reliably for each exon. Therefore variance is estimated accross exons and genes in an intensity dependent manner. The approach used in DEXSeq
and in DESeq
/DESeq2
is described in @Love2014.
The function for estimating dispersion is called
dxd = DEXSeq::estimateDispersions( dxd )
As a shrinkage diagnostic which might also be useful for quality control, one might have a look at the plot of per exon dispersion estimates versus the mean normalized counts, the resulting fitted values and the aposteriori shrinked dispersion estimates. This plot is generated by
plotDispEsts( dxd )
The formula specified with the parameter design
when creating the DEXSeqDataSet
object is taken to be the full model. This is compared to a reduced model which is missing the interaction condition:exon
. Using a $\chi^2$ test, we can compare the two models. The command for testing differential expression is
dxd = DEXSeq::testForDEU( dxd )
dxd = DEXSeq::estimateExonFoldChanges( dxd, fitExpToVar="condition")
Results can be extracted without all intermediate results using the function
dxr1 = DEXSeqResults( dxd )
The description of the different result characteristics is available via
mcols(dxr1)$description
NB: Results that are shown below here are again from the pasilla
data not from our own dataset.
From this result object, we can answer questions, like how many genes have a false discovery rate below 0.1
table ( dxr1$padj < 0.1 )
We can also ask how many genes are affected by differntial exon usage
table ( tapply( dxr1$padj < 0.1, dxr1$groupID, any ) )
To see how the power to detect differntial exon usage depends on the number of reads that map to an exon, an MA-plot is useful which shows the log fold change versus the average normalized count per exon.
plotMA( dxr1, cex=0.8 )
Now we are going back and create an initial version of the DEXSeqDataSet and specify more experimental factors. Then we are using the wrapper that runs all the analyses in one command and extract the new results from the return of the wrapper function
sampleTable <- data.frame( row.names = c( "treated1", "treated2", "treated3", "untreated1", "untreated2", "untreated3", "untreated4" ), condition = c("knockdown", "knockdown", "knockdown", "control", "control", "control", "control" ), libType = c( "single-end", "paired-end", "paired-end", "single-end", "single-end", "paired-end", "paired-end" ) ) sDxdFn <- "dxd.Rd" if (file.exists(sDxdFn)) { load(file = sDxdFn) } else { dxd <- DEXSeqDataSetFromHTSeq( countFiles, sampleData=sampleTable, design= ~ sample + exon + condition:exon, flattenedfile=flattenedFile ) } dxd = dxd[geneIDs( dxd ) %in% genesForSubset,]
dxr2 <- DEXSeq::DEXSeq(dxd, fullModel = ~ sample + exon + libType:exon + condition:exon, reducedModel = ~ sample + exon + libType:exon)
Once that new result object is constructed, we are ready to do some plots
The function plotDEXSeq
can be used to visualize results.
DEXSeq::plotDEXSeq( dxr2, "FBgn0010909", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 )
A complete report can be generated using the function
DEXSeq::DEXSeqHTML( dxr2, FDR=0.1, color=c("#FF000080", "#0000FF80") )
From the above, it becomes clear what the minimal path to the resuls is.
dxd <- DEXSeq::DEXSeqDataSetFromHTSeq( countFiles, sampleData=sampleTable, design= ~ sample + exon + condition:exon, flattenedfile=flattenedFile )
DEXSeq
dxr <- DEXSeq:::DEXSeq(dxd)
DEXSeq::DEXSeqHTML(dxr)
The analysis using the minimal number of steps is implemented in a Reference class called EzDEXSeqAnalysis
.
According to the help file given by ?DEXSeq::DEXSeqHTML
, one can specify extra columns to be added to the report when passing a dataframe with the argument extraCols
. As a test, we are just adding the reverse gene name as extra column. Then this can be done as follows
dfExtraAnot <- data.frame(InverseGenename = reverse(unique(geneIDs(dxd)))) rownames(dfExtraAnot) <- unique(geneIDs(dxd)) DEXSeq::DEXSeqHTML(dxr, extraCols = dfExtraAnot )
The result is not so convincing, as the extra-column in the report gets added to the left of the table and not to the right.
As a consequence of the low flexibility when working with the pre-specified html-output given by DEXSeq::DEXSeqHTML()
, it is preferred to generated a text-based output that can further be used and presented in different formats.
\newpage
browseVignettes()
In RStudio on http://fgcz-rstudio.uzh.ch/ the function browseVignettes()
has a problem. When called without any arguments, it lists the table of content of all vignettes, but the links do not work.
When calling the function browseVignettes()
with arguments, e.g., browseVignettes(package = "Rsubread")
, the list of vignettes for the specific package is shown, but the links do not work.
\newpage
sessionInfo()
\newpage
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.