\tableofcontents \newpage
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(nearBynding) library(Rsamtools)
## test whether the local computer can run the required programs bedtools<-suppressWarnings(system2("bedtools", "--version", stdout = NULL, stderr = NULL)) == 0 CapR<-suppressWarnings(system2("CapR", stdout = NULL, stderr = NULL)) == 0 stereogene<-suppressWarnings(system2("Stereogene", "-h", stdout = NULL, stderr = NULL)) == 0
nearBynding is a package designed to discern annotated RNA structures proximal to protein binding. nearBynding allows users to annotate RNA structure contexts via CapR or input their own annotations in BED/bedGraph format. It accomodates protein binding information from CLIP-seq experiments as either aligned CLIP-seq reads or peak-called intervals. This vignette will walk you through:
* The external software necessary to support this pipeline * Creating a concatenated transcriptome * Extracting and folding RNA from the transcriptome via CapR * Mapping protein-binding and RNA structure information onto a transcriptome * Running StereoGene to identify RNA structure proximal to protein binding * Visualizing binding results * Determining the distance between binding contexts
Before running any of these examples, it is highly recommended that the user
establishes a new empty directory and uses setwd()
to make certain that all
outputs are deposited there. Some of the functions below create multiple output
files.
if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("nearBynding")
Add all dependency directories to your PATH after installation.
bedtools is available for installation here.
Installation instructions will vary by operating system.
Download the zip file from the github repository, unzip the file, and move it to a directory where you want to permanently store the function.
In the command line, access the folder where the unzipped file is stored.
cd CapR-master
make
./CapR
If installation is successful, the final line will output
Error: The number of argument is invalid.
Download the zip file from the github repository, unzip the file, and move it to a directory where you want to permanently store the function.
In the command line, access the folder where the unzipped file is stored.
cd stereogene-master cd src make ./stereogene -h
If installation is successful, the final line will output a menu of argument options.
Although nearBynding is designed to support whole-genome analyses, we will exclusively be evaluating protein-coding genes of chromosomes 4 and 5 through this vignette.
First, a list of transcripts must be identified for analysis. A recommended criterium for selection is that the transcripts be expressed in the cell type used for CLIP-seq experiments. For this vignette, 50 random transcripts have been selected, and the 3'UTR structure of each transcript will be used for analysis, though any region of a transcript such as 5'UTR or CDS could be assessed instead.
This step creates a chain file that will be used to map the selected regions of transcripts end-to-end, excluding the intergenic regions and undesired transcripts that comprise the majority of the genome.
# load transcript list load(system.file("extdata/transcript_list.Rda", package="nearBynding")) # get GTF file gtf<-system.file("extdata/Homo_sapiens.GRCh38.chr4&5.gtf", package="nearBynding") GenomeMappingToChainFile(genome_gtf = gtf, out_chain_name = "test.chain", RNA_fragment = "three_prime_utr", transcript_list = transcript_list, alignment = "hg38")
A file containing the sizes of each concatenated chromosome in the chain file will be required for downstream analysis.
getChainChrSize(chain = "test.chain", out_chr = "chr4and5_3UTR.size")
In order to fold the 3'UTRs, the sequences must first be extracted. This is achieved with the following code:
ExtractTranscriptomeSequence(transcript_list = transcript_list, ref_genome = "Homo_sapiens.GRCh38.dna.primary_assembly.fa", genome_gtf = gtf, RNA_fragment = "three_prime_utr", exome_prefix = "chr4and5_3UTR")
The reference genome can be found through Ensembl, but for users who do not want to download that 3.2GB file for the sake of this vignette, the FASTA output of the above code is available via:
chr4and5_3UTR.fa <- system.file("extdata/chr4and5_3UTR.fa", package="nearBynding")
These sequences can then be submitted to CapR for folding.
runCapR(in_file = chr4and5_3UTR.fa)
Warning: This step can take hours or even days depending on how many transcripts are submitted, how long the RNA fragments are, and the maximum distance between base-paired nucleotides submitted to the CapR algorithm.
The nearBynding pipeline can accomodate either a BAM file of aligned CLIP-seq reads or a BED file of peak intervals. BAM files must be sorted and converted to a bedGraph file first, whereas BED files can be read-in directly.
bam <- system.file("extdata/chr4and5.bam", package="nearBynding") sorted_bam<-sortBam(bam, "chr4and5_sorted") CleanBAMtoBG(in_bam = sorted_bam)
BED or bedGraph files can then be mapped onto the concatenated transcriptome
using the chain file created by GenomeMappingToChainFile()
. This way, only
the protein binding from transcriptomic regions of interest will be considered
in the final protein binding analysis.
liftOverToExomicBG(input = "chr4and5_sorted.bedGraph", chain = "test.chain", chrom_size = "chr4and5_3UTR.size", output_bg = "chr4and5_liftOver.bedGraph")
For BED file inputs, use the additional argument format = "BED"
.
The RNA structure information from the CapR output also needs to mapped onto
the concatenated transcriptome. There are six different binding contexts
calculated by CapR -- stem, hairpin, multibranch, exterior, internal,
and bulge. processCapRout()
parses the CapR output, converts it into six
separate bedGraph files, and then performs liftOverToExomic()
for all the
files.
For this sake of this vignette, the CapR outfile is available:
processCapRout(CapR_outfile = system.file("extdata/chr4and5_3UTR.out", package="nearBynding"), chain = "test.chain", output_prefix = "chr4and5_3UTR", chrom_size = "chr4and5_3UTR.size", genome_gtf = gtf, RNA_fragment = "three_prime_utr")
It is possible for users to input their own RNA structure information rather
than using CapR. The information should be in BED file format and can be input
into liftOverToExomicBG()
to map the RNA structure data to the same
transcriptome as the protein binding data.
This is the process that directly answers the question, "What does RNA structure look like around where the protein is binding?" StereoGene is used to calculate the cross-correlation between RNA structure and protein binding in order to visualize the RNA structure landscape surrounding protein binding.
If CapR is used to determine RNA structure, runStereogeneOnCapR()
initiates
StereoGene for a given protein against all CapR-generated RNA structure
contexts.
For the sake of this vignette, use outfiles()
to pull the StereoGene output
files to your local directory if you do not want to run StereoGene.
runStereogeneOnCapR(protein_file = "chr4and5_liftOver.bedGraph", chrom_size = "chr4and5_3UTR.size", name_config = "chr4and5_3UTR.cfg", input_prefix = "chr4and5_3UTR")
get_outfiles()
If external RNA structure data is being tested, runStereogene()
initiates
analysis for a given protein and a single RNA structure context.
Note: The input track file order matters! The correct order is:
1) RNA structure 2) protein binding
Otherwise, data visualization will be inverted and all downstream analysis will be backwards.
runStereogene(track_files = c("chr4and5_3UTR_stem_liftOver.bedGraph", "chr4and5_liftOver.bedGraph"), name_config = "chr4and5_3UTR.cfg")
The cross-correlation output of StereoGene can be visualized as either a heatmap or a line plot. Examples of both are below.
For CapR-derived RNA structure, all six contexts can be viewed simultaneously.
visualizeCapRStereogene(CapR_prefix = "chr4and5_3UTR", protein_file = "chr4and5_liftOver", heatmap = TRUE, out_file = "all_contexts_heatmap", x_lim = c(-500, 500)) visualizeCapRStereogene(CapR_prefix = "chr4and5_3UTR", protein_file = "chr4and5_liftOver", x_lim = c(-500, 500), out_file = "all_contexts_line", y_lim = c(-18, 22))
knitr::include_graphics("all_contexts_heatmap.jpeg") knitr::include_graphics("all_contexts_line.pdf")
Warning: This step may take up to an hour for a full transcriptome.
Alternatively, a single context can be viewed at a time.
visualizeStereogene(context_file = "chr4and5_3UTR_stem_liftOver", protein_file = "chr4and5_liftOver", out_file = "stem_line", x_lim = c(-500, 500)) visualizeStereogene(context_file = "chr4and5_3UTR_stem_liftOver", protein_file = "chr4and5_liftOver", heatmap = TRUE, out_file = "stem_heatmap", x_lim = c(-500, 500))
knitr::include_graphics("stem_heatmap.jpeg") knitr::include_graphics("stem_line.pdf")
Although this specific, limited example does not provide a particularly visually stimulating image, larger data sets (which provide many more StereoGene windows) result in narrower peaks and less noise.
In order to determine the similarity of two binding contexts, we can calculate the Wasserstein distance between curves. A small value suggests two binding contexts are very similar, whereas larger values suggest substantial differences.
For example, calculate the distance between the stem and hairpin contexts visualized above.
bindingContextDistance(RNA_context = "chr4and5_3UTR_stem_liftOver", protein_file = "chr4and5_liftOver", RNA_context_2 = "chr4and5_3UTR_hairpin_liftOver")
Now compare it to the distance between internal and hairpin contexts.
bindingContextDistance(RNA_context = "chr4and5_3UTR_internal_liftOver", protein_file = "chr4and5_liftOver", RNA_context_2 = "chr4and5_3UTR_hairpin_liftOver")
We can see that the stem context is less similar to the hairpin context than the internal context, and this is reflected in the calculated distances.
Questions? Comments? Please email Veronica Busa at vbusa1@jhmi.edu
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.