In davetgerrard/GenomicLayers: Simple, sequence-based simulation of epi-genomes.

This will be an introductory vignette on how and why to use GenomicLayers.

Introduction

Rationale...

GenomicLayers simulates epi-genetic changes to DNA sequence or entire genomes. It does this by linking the DNA sequence, which cannot be changed, with one or more equally sized layers, which can be modified by binding factors. The binding factors may recognise specific DNA sequences or motifs or they may recognise existing modifications already applied to the layers. After binding, the binding factors may then make modifications to any or all of the layers. In this way, some regions of the sequence or genomes may change states (e.g. open vs closed) and permit or deny access to further binding factors.

Installation

You may have already done this but if not, you can install the GenomicLayers package direct from GitHub if you already have the devtools package installed and working. For devtools see here: ...

If you already have devtools, then the following commands should install GenomicLayers:-

    library(devtools)
    install_github(repo="davetgerrard/GenomicLayers",build_vignettes = TRUE)

The above may take several minutes and requires several dependencies. If it does not work, or you are in a hurry, leave out the 'build_vignettes'. Hint: make sure devtools is up to the most recent version.

    install_github(repo="davetgerrard/GenomicLayers")

To view the introduction vignette, type

    vignette("GenomicLayersIntroduction")

To view what vignettes are available, type

    vignette(package="GenomicLayers")

Set up the target sequence or genome

Load the GenomicLayers library.

library(GenomicLayers)    # (this command outputs a lot of information, which is hidden in this vignette)

GenomicLayers is designed to work on whole eukaryotic genomes including human and mouse but as these are rather large, for the purpose of demonstration we will use a few chromosomes from the Saccharomyces cerevisiae (Yeast) SacCer3.

library(BSgenome.Scerevisiae.UCSC.sacCer3)
genome <- BSgenome.Scerevisiae.UCSC.sacCer3
seqinfo(genome)

sequences_to_keep <- c("chrI", "chrII", "chrIII")   
# chrIII also contains the mating type loci HMR and HML
genomeSub <- keepBSgenomeSequences(genome, sequences_to_keep)
genomeSub    # this should now still be a useable BSgenome object but with only three  chromosomes.

Create some layers on our genome

This is where we use the sequence and build some layers on it.

sacCer.LS <- createLayerSet.BSgenome(genomeSub,
                                     n.layer=2, 
                                     layer.names=c("recruiter", "promoter"))

Create some binding factors

This is where we specify some binding factors that can bind to the sequence and, optionally, modify the layers (but not the sequence).

If you want to learn about pattern matching against DNA sequences, we recommend that you check out the vignettes in the Biostrings package. Perhaps also have a look at PWMenrich, MOTIV etc.

We first create a binding factor that can recognise a DNA motif for a TATA box, we'll call it bf1.tata and then test it against the layerSet object we've already created. It should create an GRanges or hits object with the locations for all matches to the sequence TATAA. We'll also specify that this binding factor should mark those positions on the layer recruiter. "TATAWAWA"

bf1.tata <- createBindingFactor.DNA_consensus(name="bf1.tata", patternString = "TATAWAWA", fixed =FALSE,  mod.layers="recruiter",mod.marks=1) 
#print(bf1)
matchBindingFactor.BSgenome(layerSet=sacCer.LS, bindingFactor=bf1.tata)

Then create a second binding factor that does not recognise a particular sequence, but which can bind to marks on the recruiter layer . If we test this binding factor alone, it should produce a hits object with no hits because the recruiter layer starts with not marked regions.

bf2.tataBox <- createBindingFactor.layer_region(name="bf2.tataBox", 
                          profile.layers="recruiter",profile.marks=1,                                                   mod.layers="promoter",mod.marks=1,  patternLength = 5) 
#print(bf2)
matchBindingFactor.BSgenome(layerSet=sacCer.LS, bindingFactor=bf2.tataBox)

To use multiple binding factors together in a sequence, we add them into a named list. IMPORTANT: currently, this list must have names, so typically we specify the names when building the list. The list names do not have to agree with the $name property of the individual binding factors (this allows the same binding factor to be re-used without re-specifying the model).

Additionally, create a vector of integer abundances that determine the effective activity of each binding factor. This can be fixed or varied throughout the simulation.

bf_list <- list(bf1.tata=bf1.tata, bf2.tataBox=bf2.tataBox)
#print.bfSet(list(bf1))  # not working in package. FIX this .

bf_abund <- structure(c(500, 500), names=names(bf_list))

Run a sequence of layer binding

Run a simulation and collect the output

sim.sacCer.LS <- runLayerBinding.BSgenome(layerList=sacCer.LS, factorSet=bf_list,
                                          bf.abundances = bf_abund, verbose=TRUE)

Hopefully, the layer called 'promoter' should now be marked in some places.