This vignette described how to run the Keep Me Around kma
suite to compute
intron retention in RNA-Seq experiments.
The general pipeline is as follows:
R
packageTODO: Put a more detailed flowchart here
There are two required portions of the tool to perform intron retention
quantification, though installing the post-processing tools (intron retention)
installs the python code with the exception of dependencies. The
pre-process step is written in python and only needs to be run when the
annotation changes. The post-processing step is written in R
and where
most of your interaction will take place.
If you have installed the R
package (see Installing the post-processing
tools), then you have successfully
installed the pre-processing tools. From R
, you can find the path by typing:
system.file("pre-process", package="kma")
The only additional dependencies are the Python packages pyfasta
, biopython
and pysam
.
All packages can be installed via PyPI:
pip install pyfasta biopython pysam
The tools needed for quantification are:
Please see the corresponding documentation on their respective web pages.
The pre- and post-processing tools are contained in an R
packaged called
kma
. The development version is always on
Github. To install, kma
and required R
dependencies, runs the code below:
required_packages <- c("devtools", "data.table", "reshape2", "dplyr") install.packages(required_packages) devtools::install_github("http://github.com/pachterlab/kma")
You can then load the kma
R package like you would any other R packages:
library("kma")
We recommend organizing your experiments in the following format:
experiment |____conditionA | |____conditionA1 | |____conditionA2 | |____conditionA3 |____conditionB | |____conditionB1 | |____conditionB2 | |____conditionB3
where experiment
is the top level for the particular set of experiments,
conditionX
refers to the condition (e.g. tumor, or control), and
conditionXY
represents the biological replicates. This allows for
a structured way to keep track of your data as well as an easy way to load it
in R
. An example can be seen in the worked example
section.
Note: Pre-processing only has to be run when the gene annotation changes. If you use the same annotation for many experiments, you won't be running this tool often.
The path to the pre-processing tools can be found in R by typing
system.file("preprocess", package="kma")
. Open a terminal and put this path
in an environment variable:
PRE=/path/to/kma/preprocess
If you are on a Mac, the path might look like: /Library/Frameworks/R.framework/Versions/3.1/Resources/library/kma/preprocess
We are now ready to generate the introns. A typical command line call to the pre-processing tools looks like:
python $PRE/generate_introns.py --genome seq.fa --gtf trans.gtf --extend N --out out_dir
Where the inputs are:
--genome seq.fa
: genome sequence in Multi-FASTA
format where contig names
correspond to the GTF file.--gtf trans.gtf
: annotation file in GTF
format.--extend N
: Optional, though we strongly recommend setting it. If set, N
is the number of bases bases to overlap into the intron. Note, this should be
at most read_length - 1
, but we suggest making it smaller.--out out_dir
: A directory to write the outputs. Will be created if doesn't
already exist.The following outputs will then be put in out_dir
:
Note: Technically any quantification tool can be used with kma
, but
currently only support with eXpress
is implemented. Please contact me on
Github if you're interested in
a quantification tool being supported.
Here, we will discuss quantification. After generate_introns.py
is run,
introns.fa
should be combined with the full transcript sequences. This can be
done using the Linux command cat
:
cat trans.fa introns.fa > trans_and_introns.fa
We assume the file name is trans_and_introns.fa
in the following sections, but
the file name can be anything.
After you've done this, this section requires the following steps:
See the Bowtie 2 manual for more advanced options.
bowtie2-build --offrate 1 trans_and_introns.fa trans_and_introns
This only has to be run once if every time you decide to change the gene annotation.
Once you have a Bowtie 2 index, you can align any number of RNA-Seq experiments to that index. The following arguments are recommended when running Bowtie 2:
bowtie2 -k 200 --rdg 6,5 --rfg 6,5 --score-min L,-.6,-.4 -X trans_and_introns -1 left.fastq -2 right.fastq | samtools view -Sb - > hits.bam
If you don't have many reads, you might consider replacing the -k 200
arugment with -a
.
Once you've aligned the reads, you can run eXpress against the alignments. See the eXpress website for additional arguments. The general eXpress call is as follows:
express trans_and_introns.fa hits.bam
If you don't have many reads, consider running additional batch iterations
using the -B
argument.
The first step is to the load the quantification data into R
. The function
read_express
takes a list of file names, sample names, and condition names.
read_express
then returns a list with the attributes:
tpm
- a data.frame
with TPM of all samplesuniq_counts
- a data.frame
with the number of unique counts of all
samplesall_data
- a list of data.frame
s from all the eXpress output. Sorted by
target_id
.sample
- a character vector for each sample, describing the sample (e.g.
tumor1
, tumor2
)condition
- a character vector for each sample describing the grouping
(e.g. tumor
)This data can then be used by newIntronRetention
. An example of this can be
seen in the worked example.
We will run through a small example. The data set including reads can be found here. If you are not interested in the pre-processing and quantification steps (some might find it simple), the results from the pipeline exist in the R package:
system.file("example", package="kma")
In this case, you can simply jump to the post-processing phase which runs through using the R package itself.
This example contains the reads, annotation, and sequences for the SF3B1 gene in orthochromatic erythroblasts from GSE53635. This dataset is unrealistically small, so the results aren't entirely reliable, but can be run very quickly through the pipeline. Thus, the results aren't very reliable, but serves as an example of how to run through the pipeline.
Run system.file
in R
to get the pre-processing directory (see the
pre-processing section above for more information):
system.file("pre-process", package = "kma")
Then, store this value in an environment variable in your shell:
PRE=/path/to/kma/pre-process
Extract the data from the downloaded archive and move to the example directory and run the pre-processing:
tar -xf kma_example.tar.gz cd kma_example/example python $PRE/generate_introns.py --genome genome/chr2.fa \ --gtf annotation/refGene_sf3b1.gtf --extend 25 \ --out kma_pre-process_out
We chose --extend 25
here because the reads are 51 bases long. An overlap of
25 bases on each side will give us a fair amount of junction reads without
resulting in spurious alignments. If your read length is longer, feel free to
set this value higher.
The output files will then be stored in kma_pre-process_out
which you can
view using ls kma_pre-process_out
:
intron_to_transcripts.txt introns.bed introns.fa
The introns.fa
file can now be merged with the transcriptome sequences in the
quantification section.
Next, we need to merge the transcriptome and intron sequences. This can be done
using the command cat
:
cat annotation/sf3b1.fa kma_pre-process_out/introns.fa > annotation/sf3b1_with_introns.fa
The file annotations/sf3b1_with_introns.fa
now contains the transcriptome
sequences and the intron sequences. This FASTA file can now be used to build
a Bowtie 2 index for alignment.
We want to align against the introns and transcripts. Thus, we will build a Bowtie 2 index. See the Bowtie 2 website for more information on the parameters:
bowtie2-build --offrate 1 annotation/sf3b1_with_introns.fa annotation/sf3b1_with_introns
This call will result in many Bowtie 2 index files (*.bt2
files).
For alignment, we will only align one sample as aligning other samples can be done similarly and can be automated. Please note, the reads we are aligning are single-end. The command for paired-end reads can be found on the Bowtie 2 website.
bowtie2 -k 200 --rdg 6,5 --rfg 6,5 --score-min L,-.6,-.4 -x annotation/sf3b1_with_introns \ -U experiment/ortho/ortho1/*.fastq.gz | samtools view -Sb - > experiment/ortho/ortho1/sf3b1.bam
This will result in a BAM file experiment/ortho/ortho1/sf3b1.bam
which can
then be quantified using eXpress.
Now that we have alignment files, we can quantify isoform and intron expression with eXpress:
express -o experiment/ortho/ortho1/xprs_out annotation/sf3b1_with_introns.fa \
experiment/ortho/ortho1/sf3b1.bam
The quantification results will now be in experiment/ortho/ortho1/xprs_out
.
The other samples can be processed similarly.
Since we organized our data nicely, we can load the data into R quite easily
using Sys.glob
:
base_dir <- system.file("example", package="kma") xprs_fnames <- Sys.glob(file.path(base_dir, "experiment/*/*/xprs_out/results.xprs")) xprs_fnames
Sys.glob
does wildcard expansion on a character to find file names.
Sample names can be inferred from the file paths by removing the extraneous
information using sub
and gsub
.
sample_names <- sub(file.path(base_dir, "experiment/[a-z]+/"), "", xprs_fnames) %>% sub("xprs_out/results.xprs", "", .) %>% gsub("/", "", .) sample_names
Since we labelled replicates according to the condition they are a part of, you can also infer condition names from the sample:
condition_names <- sub("[0-9]+", "", sample_names) condition_names
You now have three R variables:
xprs_fnames
- a character vector of file names pointing to the
quantification filessample_names
- a character vector of identifiers for each samplecondition_names
- a character vector of identifiers of conditionsA user need not use Sys.glob
or have their data structured the way we have
presented, but if they do, manually typing the sample and condition names is
unnecessary.
We can now load all the eXpress data into R:
xprs <- read_express(xprs_fnames, sample_names, condition_names)
This results in a named list with the following members:
names(xprs)
We only need one other file, intron_to_transcripts.txt
, which was created
during the pre-processing step. We can read that in using read.table
or
data.table::fread
which is much faster, just be sure to use the data.table
= FALSE
flag as seen below:
intron_to_trans <- data.table::fread(file.path(base_dir, "kma_pre-process_out", "intron_to_transcripts.txt"), data.table = FALSE) head(intron_to_trans)
To create an IntronRetention
object, you can call newIntronRetention
with
the following command:
ir <- newIntronRetention(xprs$tpm, intron_to_trans, xprs$condition, xprs$uniq_counts)
newIntronRetention
take the following arguments:
See help(newIntronRetention)
for more details on the arguments.
Printing the IntronRetention
object results in a short summary:
print(ir)
The data member flat
in the IntronRetention
object is where most of the
operations happen. Advanced users might be interested in performing direct
exploratory analysis on this table, which is similar to a denormalized SQL
table (often reffered to as a "tidy" table in R).
head(ir$flat)
Filter functions have the following format: filter_name(ir, options)
where
ir
is an IntronRetention
object, and options are options for the filter.
They return an IntronRetention
object with an updated flat
member. Filters
can be implemented by the user, but should return TRUE
if the intron passes
the filter and FALSE
otherwise. The column name for each filter should begin
with f_
.
These filters below filter denominators that have expression below 1 TPM,
introns that have exactly 0 or 1 PSI (due to either no contributions from the
intron coverage or solely coverage from the intron). filter_low_frags
filters
out introns that don't have at least N
number of unique reads:
ir <- ir %>% filter_low_tpm(1) %>% filter_perfect_psi() %>% filter_low_frags(3) colnames(ir$flat)
The zero coverage filter is a special type of filter that has to be computed on the read alignments and corresponding eXpress results.
python $PRE/zeroCoverage.py experiment/ortho/ortho1/xprs_out/results.xprs \ experiment/ortho/ortho1/sf3b1.bam \ experiment/ortho/ortho1/zero_coverage.txt
The output is then in experiment/ortho/ortho1/zero_coverage.txt
. Zero
coverage data can be read in using the function get_batch_intron_zc
. Like
read_express
it needs file names, sample names, and condition names:
zc_fnames <- Sys.glob(file.path(base_dir, "experiment/*/*/zero_coverage.txt")) zc_samples <- sub(file.path(base_dir, "experiment/[a-z]+/"), "", zc_fnames) %>% sub("zero_coverage.txt", "", .) %>% gsub("/", "", .) zc_conditions <- sub("[0-9]+", "", zc_samples) all_zc <- get_batch_intron_zc(zc_fnames, zc_samples, zc_conditions) head(all_zc)
The result from get_batch_intron_zc
is a data.frame
that can be inspected
manually if you'd like. To incorporate it with a IntronRetention
object, use
summarize_zero_coverage
:
ir <- summarize_zero_coverage(ir, all_zc)
This adds a new column to ir$flat
called f_zc_*
:
colnames(ir$flat)
Hypothesis testing aggregates all filters (by taking the intersection of all
filters), computes the null distribution, then returns a data frame summarizing
the test. Since there is some randomness involved in generating the
permutations, you might consider setting a seed before calling retention_test
as we do below:
set.seed(42) ir_test <- retention_test(ir) head(ir_test)
Significant introns can be found using the dplyr function filter
:
ir_test %>% filter(qvalue <= 0.10) %>% select(-c(pvalue))
In this case we only see one significant intron. Since we have such a small dataset, the null distribution isn't that reliable so the results shouldn't be taken too seriously, but when the data set is larger (like in a normal experiment), the results are more reliable.
TODO This section is coming soon. It will showcase how to make plots as well as some diagnostics one might do on a complete data set.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.