Simulating nascent RNA data

The procedure we use for simulating signals of polymerase density uses composites of empirical signal from exemplar regions. To do this, first we must curate a set of genes where virtually all polymerase density in the region can be confidently attributed to a single annotation. Included with this package are a curated set of archetypes that can be loaded as follows.

library(nascentRNASim)
ta <- load_archetype_set(id = "celastrol_dukler2017")

Archetypes can be viewed as follows:

view_archetype(ta, 1)

Because we simulate empirically from known profiles, the length of transcripts we can simulate from are limited to the lengths we already have profiles for. This means that we must simulate transcript annotations based on our archetype set so that the lengths of the transcripts match. However, we do simulate such that we match the properties of an existing template as closely as possible. In order to do this we require a template which can be any existing GRanges object. In this case we use a pre-cached txdb of hg38 and sample a thousand loci for simulation.

cache_set_dir()
# Only download a txdb if you need a list of transcripts
# download_txdb("hsapiens", ensembl_version = "98")
txdb <- suppressMessages(load_txdb("homo_sapiens-grch37.p13-ensembl_genes_99.txdb"))
template <- transcripts(txdb)
sim_anno <- simulate_annotations(ta = ta, n = 100, template = template, seed = 5,
                                 genome = "hg19")

Then we simulate abundances for the annotations. In order to do that you must specify a function that generates a vector of random numbers greater than or equal to zero. It also must have as an argument n which specifies the number of observations to draw (A number of the random number generators from the stats package fits all these requirements; eg. rlnorm, rpois, rgamma, etc.). Additionally, you may design your own custom function to pass here as long as it meets these criteria. Additional arguments to the sampling_dist function can be passed with their keywords. There is also a zero_point_mass argument that sets the fraction of genes with zero abundance.

annotations <- simulate_abundances(annotations = sim_anno, sampling_dist = rgtex)

Then we resample from the archetypes to simulate 5' read density

sim_dat <- simulate_data(ta, annotations = annotations, show_progress = T)

Once we've simulated data, the annotations and data it can be exported to a directory along with the archetype object used to generate it.

tmp_dir <- tempdir()
export_simulation(annotations = annotations, data = sim_dat, ta = ta, 
                    directory = tmp_dir,
                    simulation_id = "test")

Seting up new archetypes

Curate your own list of archetypes

No automated way to do this currently. Generally we recommend filtering a list of transcripts by a crude metric fo per transcript abundance to get only high abundance candidates, then manually curating those that would be good archetypes. Here are the rough criteria we used in curating the celastrol_dukler2017 archetypes:

  1. The transcript is clearly the only one producing the vast majority of the signal (or allowed multiple transcripts producing signal if they all have about the same TSS/TES). This is cross-checked with PRO-cap data from Core & Martins et al. 2014.
  2. The signal between the transcript and adjacent transcripts drops off to approximately background levels between them.
  3. The start of transcription is pretty close the the annotated TSS.
  4. The transcript averages one read per 20bp when quantified on the combined replicates.

Once a set of archetypes has been curated you can store a reduced lightweight bigwig that contains only the archetype regions plus some flanking signal. This makes it easier to package up for other people to use.

# store_reduced_archetype_bw(bed_path, bigwig_plus, bigwig_minus,
# out_dir = "~/Downloads", flank_length = 2e4)

Create a transcript_archetypes object

  id_path <- system.file("extdata", "archetype_sets", "celastrol_dukler2017",
                         package = "nascentRNASim")
  # Load archetype annotations
  bed_path <- file.path(id_path, paste0(id, "_archetypes.bed"))
  bed <- rtracklayer::import.bed(bed_path)

  # Load bigwigs
  bwp_path <- file.path(id_path, paste0(id, "_plus.bw"))
  bwm_path <- file.path(id_path, paste0(id, "_minus.bw"))
  # Create transcript_archetype object, see documentation for details
  ta <- transcript_archetypes(transcripts = bed, bigwig_plus = bwp_path,
                              bigwig_minus = bwm_path,
                              flank = 6e3,
                              abundance_filter = 0.05,
                              mask = c(1e3, 1e3))


CshlSiepelLab/nascentRNASim documentation built on Dec. 17, 2021, 3:08 p.m.