In pachterlab/kma: KeepMeAround: Intron Retention Detection in RNA-Seq data

This vignette described how to run the Keep Me Around kma suite to compute intron retention in RNA-Seq experiments.

General pipeline

The general pipeline is as follows:

Pre-process to generate intron coordinates and sequences
Quantify transcript expression using eXpress against augmented transcriptome containing introns
Post-process to compute intron retention using R package

TODO: Put a more detailed flowchart here

Installing

There are two required portions of the tool to perform intron retention quantification, though installing the post-processing tools (intron retention) installs the python code with the exception of dependencies. The pre-process step is written in python and only needs to be run when the annotation changes. The post-processing step is written in R and where most of your interaction will take place.

Installing the pre-processing tools

If you have installed the R package (see Installing the post-processing tools), then you have successfully installed the pre-processing tools. From R, you can find the path by typing:

system.file("pre-process", package="kma")

The only additional dependencies are the Python packages pyfasta, biopython and pysam. All packages can be installed via PyPI:

pip install pyfasta biopython pysam

Installing the quantification tools

The tools needed for quantification are:

Bowtie 2 short read aligner
eXpress RNA-Seq quantification tool

Please see the corresponding documentation on their respective web pages.

Installing the post-processing tools

The pre- and post-processing tools are contained in an R packaged called kma. The development version is always on Github. To install, kma and required R dependencies, runs the code below:

required_packages <- c("devtools", "data.table", "reshape2", "dplyr")
install.packages(required_packages)
devtools::install_github("http://github.com/pachterlab/kma")

You can then load the kma R package like you would any other R packages:

library("kma")

Organization

We recommend organizing your experiments in the following format:

experiment
|____conditionA
| |____conditionA1
| |____conditionA2
| |____conditionA3
|____conditionB
| |____conditionB1
| |____conditionB2
| |____conditionB3

where experiment is the top level for the particular set of experiments, conditionX refers to the condition (e.g. tumor, or control), and conditionXY represents the biological replicates. This allows for a structured way to keep track of your data as well as an easy way to load it in R. An example can be seen in the worked example section.

Generate intron coordinates (pre-processing)

Note: Pre-processing only has to be run when the gene annotation changes. If you use the same annotation for many experiments, you won't be running this tool often.

The path to the pre-processing tools can be found in R by typing system.file("preprocess", package="kma"). Open a terminal and put this path in an environment variable:

PRE=/path/to/kma/preprocess

If you are on a Mac, the path might look like: /Library/Frameworks/R.framework/Versions/3.1/Resources/library/kma/preprocess

We are now ready to generate the introns. A typical command line call to the pre-processing tools looks like:

python $PRE/generate_introns.py --genome seq.fa --gtf trans.gtf --extend N --out out_dir

Where the inputs are:

--genome seq.fa: genome sequence in Multi-FASTA format where contig names correspond to the GTF file.
--gtf trans.gtf: annotation file in GTF format.
--extend N: Optional, though we strongly recommend setting it. If set, N is the number of bases bases to overlap into the intron. Note, this should be at most read_length - 1, but we suggest making it smaller.
--out out_dir: A directory to write the outputs. Will be created if doesn't already exist.

The following outputs will then be put in out_dir:

introns.fa - a FASTA file containing the intron sequences.
introns.bed - BED file with coordinates used to quantify intron.
intron_to_transcripts.txt - a table of intron to transcript relationships. This file is used during the post-processing phase.

Quantification

Note: Technically any quantification tool can be used with kma, but currently only support with eXpress is implemented. Please contact me on Github if you're interested in a quantification tool being supported.

Here, we will discuss quantification. After generate_introns.py is run, introns.fa should be combined with the full transcript sequences. This can be done using the Linux command cat:

cat trans.fa introns.fa > trans_and_introns.fa

We assume the file name is trans_and_introns.fa in the following sections, but the file name can be anything.

After you've done this, this section requires the following steps:

Create the Bowtie 2 index
Align reads to the augmented transcriptome
Quantify against the augmented transcriptome

Creating the Bowtie 2 index

See the Bowtie 2 manual for more advanced options.

bowtie2-build --offrate 1 trans_and_introns.fa trans_and_introns

This only has to be run once if every time you decide to change the gene annotation.

Align reads

Once you have a Bowtie 2 index, you can align any number of RNA-Seq experiments to that index. The following arguments are recommended when running Bowtie 2:

bowtie2 -k 200 --rdg 6,5 --rfg 6,5 --score-min L,-.6,-.4 -X trans_and_introns
    -1 left.fastq -2 right.fastq | samtools view -Sb - > hits.bam

If you don't have many reads, you might consider replacing the -k 200 arugment with -a.

Quantify

Once you've aligned the reads, you can run eXpress against the alignments. See the eXpress website for additional arguments. The general eXpress call is as follows:

express trans_and_introns.fa hits.bam

If you don't have many reads, consider running additional batch iterations using the -B argument.

Computing intron retention (post-processing)

The first step is to the load the quantification data into R. The function read_express takes a list of file names, sample names, and condition names. read_express then returns a list with the attributes:

tpm - a data.frame with TPM of all samples
uniq_counts - a data.frame with the number of unique counts of all samples
all_data - a list of data.frames from all the eXpress output. Sorted by target_id.
sample - a character vector for each sample, describing the sample (e.g. tumor1, tumor2)
condition - a character vector for each sample describing the grouping (e.g. tumor)

This data can then be used by newIntronRetention. An example of this can be seen in the worked example.

A worked example

We will run through a small example. The data set including reads can be found here. If you are not interested in the pre-processing and quantification steps (some might find it simple), the results from the pipeline exist in the R package:

system.file("example", package="kma")

In this case, you can simply jump to the post-processing phase which runs through using the R package itself.

This example contains the reads, annotation, and sequences for the SF3B1 gene in orthochromatic erythroblasts from GSE53635. This dataset is unrealistically small, so the results aren't entirely reliable, but can be run very quickly through the pipeline. Thus, the results aren't very reliable, but serves as an example of how to run through the pipeline.

Pre-processing

Run system.file in R to get the pre-processing directory (see the pre-processing section above for more information):

system.file("pre-process", package = "kma")

Then, store this value in an environment variable in your shell:

PRE=/path/to/kma/pre-process

Extract the data from the downloaded archive and move to the example directory and run the pre-processing:

tar -xf kma_example.tar.gz
cd kma_example/example
python $PRE/generate_introns.py --genome genome/chr2.fa \
    --gtf annotation/refGene_sf3b1.gtf --extend 25 \
    --out kma_pre-process_out

We chose --extend 25 here because the reads are 51 bases long. An overlap of 25 bases on each side will give us a fair amount of junction reads without resulting in spurious alignments. If your read length is longer, feel free to set this value higher.

The output files will then be stored in kma_pre-process_out which you can view using ls kma_pre-process_out:

intron_to_transcripts.txt introns.bed               introns.fa

The introns.fa file can now be merged with the transcriptome sequences in the quantification section.

Quantification

Merging the transcriptome and intron sequences

Next, we need to merge the transcriptome and intron sequences. This can be done using the command cat:

cat annotation/sf3b1.fa kma_pre-process_out/introns.fa > annotation/sf3b1_with_introns.fa

The file annotations/sf3b1_with_introns.fa now contains the transcriptome sequences and the intron sequences. This FASTA file can now be used to build a Bowtie 2 index for alignment.

Building the Bowtie 2 index

We want to align against the introns and transcripts. Thus, we will build a Bowtie 2 index. See the Bowtie 2 website for more information on the parameters:

bowtie2-build --offrate 1 annotation/sf3b1_with_introns.fa annotation/sf3b1_with_introns

This call will result in many Bowtie 2 index files (*.bt2 files).

Alignment

For alignment, we will only align one sample as aligning other samples can be done similarly and can be automated. Please note, the reads we are aligning are single-end. The command for paired-end reads can be found on the Bowtie 2 website.

bowtie2 -k 200 --rdg 6,5 --rfg 6,5 --score-min L,-.6,-.4 -x annotation/sf3b1_with_introns \
    -U experiment/ortho/ortho1/*.fastq.gz |
    samtools view -Sb - > experiment/ortho/ortho1/sf3b1.bam

This will result in a BAM file experiment/ortho/ortho1/sf3b1.bam which can then be quantified using eXpress.

Quantification with eXpress

Now that we have alignment files, we can quantify isoform and intron expression with eXpress:

express -o experiment/ortho/ortho1/xprs_out annotation/sf3b1_with_introns.fa \
    experiment/ortho/ortho1/sf3b1.bam

The quantification results will now be in experiment/ortho/ortho1/xprs_out. The other samples can be processed similarly.

Post-processing

Loading the files

Since we organized our data nicely, we can load the data into R quite easily using Sys.glob:

base_dir <- system.file("example", package="kma")
xprs_fnames <- Sys.glob(file.path(base_dir, "experiment/*/*/xprs_out/results.xprs"))
xprs_fnames

Sys.glob does wildcard expansion on a character to find file names.

Sample names can be inferred from the file paths by removing the extraneous information using sub and gsub.

sample_names <- sub(file.path(base_dir, "experiment/[a-z]+/"), "", xprs_fnames) %>%
    sub("xprs_out/results.xprs", "", .) %>%
    gsub("/", "", .)
sample_names

Since we labelled replicates according to the condition they are a part of, you can also infer condition names from the sample:

condition_names <- sub("[0-9]+", "", sample_names)
condition_names

You now have three R variables:

xprs_fnames - a character vector of file names pointing to the quantification files
sample_names - a character vector of identifiers for each sample
condition_names - a character vector of identifiers of conditions

A user need not use Sys.glob or have their data structured the way we have presented, but if they do, manually typing the sample and condition names is unnecessary.

We can now load all the eXpress data into R:

xprs <- read_express(xprs_fnames, sample_names, condition_names)

This results in a named list with the following members:

names(xprs)

We only need one other file, intron_to_transcripts.txt, which was created during the pre-processing step. We can read that in using read.table or data.table::fread which is much faster, just be sure to use the data.table = FALSE flag as seen below:

intron_to_trans <- data.table::fread(file.path(base_dir, "kma_pre-process_out",
    "intron_to_transcripts.txt"), data.table = FALSE)
head(intron_to_trans)

To create an IntronRetention object, you can call newIntronRetention with the following command:

ir <- newIntronRetention(xprs$tpm, intron_to_trans, xprs$condition,
    xprs$uniq_counts)

newIntronRetention take the following arguments:

expression data frame (in this case we are using TPM from eXpress)
intron to transcripts relationship table
condition character vector
Optional: a unique counts vector. This is used for filtering later on

See help(newIntronRetention) for more details on the arguments.

Printing the IntronRetention object results in a short summary:

print(ir)

The data member flat in the IntronRetention object is where most of the operations happen. Advanced users might be interested in performing direct exploratory analysis on this table, which is similar to a denormalized SQL table (often reffered to as a "tidy" table in R).

head(ir$flat)

Filtering

Filter functions have the following format: filter_name(ir, options) where ir is an IntronRetention object, and options are options for the filter. They return an IntronRetention object with an updated flat member. Filters can be implemented by the user, but should return TRUE if the intron passes the filter and FALSE otherwise. The column name for each filter should begin with f_.

These filters below filter denominators that have expression below 1 TPM, introns that have exactly 0 or 1 PSI (due to either no contributions from the intron coverage or solely coverage from the intron). filter_low_frags filters out introns that don't have at least N number of unique reads:

ir <- ir %>%
    filter_low_tpm(1) %>%
    filter_perfect_psi() %>%
    filter_low_frags(3)
colnames(ir$flat)

Zero coverage

The zero coverage filter is a special type of filter that has to be computed on the read alignments and corresponding eXpress results.

python $PRE/zeroCoverage.py experiment/ortho/ortho1/xprs_out/results.xprs \
    experiment/ortho/ortho1/sf3b1.bam \
    experiment/ortho/ortho1/zero_coverage.txt

The output is then in experiment/ortho/ortho1/zero_coverage.txt. Zero coverage data can be read in using the function get_batch_intron_zc. Like read_express it needs file names, sample names, and condition names:

zc_fnames <- Sys.glob(file.path(base_dir, "experiment/*/*/zero_coverage.txt"))
zc_samples <- sub(file.path(base_dir, "experiment/[a-z]+/"), "", zc_fnames) %>%
    sub("zero_coverage.txt", "", .) %>%
    gsub("/", "", .)
zc_conditions <- sub("[0-9]+", "", zc_samples)
all_zc <- get_batch_intron_zc(zc_fnames, zc_samples, zc_conditions)
head(all_zc)

The result from get_batch_intron_zc is a data.frame that can be inspected manually if you'd like. To incorporate it with a IntronRetention object, use summarize_zero_coverage:

ir <- summarize_zero_coverage(ir, all_zc)

This adds a new column to ir$flat called f_zc_*:

colnames(ir$flat)

Hypothesis testing

Hypothesis testing aggregates all filters (by taking the intersection of all filters), computes the null distribution, then returns a data frame summarizing the test. Since there is some randomness involved in generating the permutations, you might consider setting a seed before calling retention_test as we do below:

set.seed(42)
ir_test <- retention_test(ir)
head(ir_test)

Significant introns can be found using the dplyr function filter:

ir_test %>%
    filter(qvalue <= 0.10) %>%
    select(-c(pvalue))

In this case we only see one significant intron. Since we have such a small dataset, the null distribution isn't that reliable so the results shouldn't be taken too seriously, but when the data set is larger (like in a normal experiment), the results are more reliable.

Intron retention in terminal erythropoiesis

TODO This section is coming soon. It will showcase how to make plots as well as some diagnostics one might do on a complete data set.

pachterlab/kma documentation built on May 24, 2019, 5:58 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

pachterlab/kma
KeepMeAround: Intron Retention Detection in RNA-Seq data

In pachterlab/kma: KeepMeAround: Intron Retention Detection in RNA-Seq data

General pipeline

Installing

Installing the pre-processing tools

Installing the quantification tools

Installing the post-processing tools

Organization

Generate intron coordinates (pre-processing)

Quantification

Creating the Bowtie 2 index

Align reads

Quantify

Computing intron retention (post-processing)

A worked example

Pre-processing

Quantification

Merging the transcriptome and intron sequences

Building the Bowtie 2 index

Alignment

Quantification with eXpress

Post-processing

Loading the files

Filtering

Zero coverage

Hypothesis testing

Intron retention in terminal erythropoiesis

R Package Documentation

Browse R Packages

We want your feedback!

pachterlab/kma KeepMeAround: Intron Retention Detection in RNA-Seq data

In pachterlab/kma: KeepMeAround: Intron Retention Detection in RNA-Seq data

General pipeline

Installing

Installing the pre-processing tools

Installing the quantification tools

Installing the post-processing tools

Organization

Generate intron coordinates (pre-processing)

Quantification

Creating the Bowtie 2 index

Align reads

Quantify

Computing intron retention (post-processing)

A worked example

Pre-processing

Quantification

Merging the transcriptome and intron sequences

Building the Bowtie 2 index

Alignment

Quantification with eXpress

Post-processing

Loading the files

Filtering

Zero coverage

Hypothesis testing

Intron retention in terminal erythropoiesis

R Package Documentation

Browse R Packages

We want your feedback!

pachterlab/kma
KeepMeAround: Intron Retention Detection in RNA-Seq data