knitr::opts_chunk$set(
  collapse = TRUE,
  # comment = "#>", 
  tidy = FALSE, 
  cache = FALSE, 
  results = 'markup'
)

Introduction

SingleMoleculeFootprinting is an R package providing functions to analyze Single Molecule Footprinting (SMF) data. Following the workflow exemplified in this vignette, the user will be able to perform basic data analysis of SMF data with minimal coding effort. \

Starting from an aligned bam file, we show how to \

For compatibility with our analysis tools, we recommend the user to perform genomic alignments using the qAlign function from QuasR as exemplified in the chuck below.

cl = makeCluster(40)
prj = QuasR::qAlign(sampleFile = Qinput,
              genome = "BSgenome.Mmusculus.UCSC.mm10",
              aligner = "Rbowtie",
              projectName = "prj", 
              paired = "fr",
              bisulfite = "undir", 
              alignmentParameter = "-e 70 -X 1000 -k 2 --best -strata",
              alignmentsDir = "./", 
              cacheDir = tempdir(),
              clObj = cl)

Installation {.tabset}

GitHub installation

For installation from GitHub, visit our repository.

remotes::install_github(repo = "https://github.com/Krebslabrep/SingleMoleculeFootprinting.git", ref = "main")

Bioconductor installation (NOT AVAILABLE YET)

For Bioconductor installation, visit our page on Bioconductor

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("SingleMoleculeFootprinting")

Loading

library(BSgenome.Mmusculus.UCSC.mm10)
library(SingleMoleculeFootprinting)

Define arguments

SingleMoleculeFootprinting inherits QuasR's philosophy of working with pointer files. Briefly, a pointer is a tab-delimited file with two or three fields indicating the location of the input data files. For more details, please check the qAlign documentation. Here we show how our pointer file looks like.\

N.b.: The execution of the example code in this vignette, as well as some functions of this software package, depend on accessory data that can be downloaded and cached through the data package SingleMoleculeFootprintingData. Under the hood, we download this data for the user and create the QuasR sampleSheet file for the example data at tempdir() under the name NRF1Pair_Qinput.txt.\ We just ask the user to make sure that their default cache directory has enough storage capacity (~1 Gb). The user can check this via ExperimentHub::getExperimentHubOption(arg = "CACHE") and eventually change this value via ExperimentHub::setExperimentHubOption(arg = "CACHE", value = "/your/favourite/location").

# Under the hood:
#       download and cache SingleMoleculeFootprintingData data
#       locate bam file and make QuasR sampleSheet

CacheDir = ExperimentHub::getExperimentHubOption(arg = "CACHE") # <------- grab this as the cache dir. And write somewhere to set a sensible location for the user
eh = ExperimentHub::ExperimentHub()
SMFdata = AnnotationHub::query(eh, "SingleMoleculeFootprintingData")
ExperimentHub::loadResources(hub = SMFdata, package = "SingleMoleculeFootprintingData")

RecordIDs = rownames(AnnotationHub::mcols(x = SMFdata))[1:2]
CachedBamPath = AnnotationHub::fileName(object = SMFdata)[names(AnnotationHub::fileName(object = SMFdata))==RecordIDs[1]][[1]]
CachedBaiPath = AnnotationHub::fileName(object = SMFdata)[names(AnnotationHub::fileName(object = SMFdata))==RecordIDs[2]][[1]]
# Copy example data to tempdir()
file.copy(from = CachedBamPath, to = tempdir())
file.copy(from = CachedBaiPath, to = tempdir())
# Rename
file.rename(from = paste0(tempdir(), "/", rev(rev(base::strsplit(CachedBamPath, "/")[[1]])[1])), to = paste0(tempdir(), "/NRF1pair.bam"))
file.rename(from = paste0(tempdir(), "/", rev(rev(base::strsplit(CachedBaiPath, "/")[[1]])[1])), to = paste0(tempdir(), "/NRF1pair.bam.bai"))
# Make sampleSheet
data.frame(
  FileName = list.files(tempdir(), pattern = ".bam$", full.names = TRUE),
  SampleName = "NRF1pair_DE_"
  ) -> df
readr::write_delim(df, paste0(tempdir(), "/NRF1Pair_Qinput.txt"), delim = "\t")
Qinput = paste0(tempdir(), "/NRF1Pair_Qinput.txt")
suppressMessages(readr::read_delim(Qinput, delim = "\t"))

Library QCs

Before investing in a deep sequencing run for a SMF experiment, it is advisable to first perform a shallow sequencing run and to perform quality controls on the sequencing libraries.\

QuasR QC report

It is always a good idea to produce a QC report after aligning. We can suggest the user to employ the qQCreport function from QuasR.

qQCReport(Qinput, pdfFilename=NULL, chunkSize=1e6L, useSampleNames=FALSE, clObj=NULL)

Bait capture efficiency

If SMF was performed on an large genome (e.g Mouse) it is possible that bait capture was performed to focus the sequencing efforts: here we check how efficient that process was by essentially computing the ratio of genomic alignments inside bait regions over the total ones.

BaitRegions = SingleMoleculeFootprintingData::EnrichmentRegions_mm10.rds()
BaitCaptureEfficiency = BaitCapture(sampleSheet = Qinput, genome = BSgenome.Mmusculus.UCSC.mm10, baits = BaitRegions)
print(BaitCaptureEfficiency)

In this case the capture efficiency equals to r toString(BaitCaptureEfficiency) because the example data was purposefully subset for an interesting region which falls entirely within baits. Under normal circumstances, one expects this value to be lower than 1: for the mouse genome for example we observe values around 0.7.

Conversion rate precision

Here we ask how much of the observed Cytosine methylation falls in the expected contexts (CpG or GpC). Beware, this is a much slower computation than the above.

ConversionRatePrecision = ConversionRate(sampleSheet = Qinput, genome = BSgenome.Mmusculus.UCSC.mm10, chr = 6)
## the code above takes > 10 minutes to run, so we load a pre-computed value here
ConversionRatePrecision = readRDS(file = system.file("extdata", "vignette_ConversionRatePrecision.rds", 
                                                     package = "SingleMoleculeFootprinting"))

For this sample, the observed conversion rate is r toString(ConversionRatePrecision)%. This value happens to be slightly below the expected value of ~95%

### Intersample correlation
Compare methylation fractions across samples. This gives a broad overview of whether the methylation pattern of a cell type or condition reflects what previously observed

Analysis of single site

Methylation extraction

Methylation values at the single molecule level can be extracted for the region of interest from aligned data using the CallContextMethylation function. \ Under the hood, the function performs the following operations: \

knitr::kable(data.frame(ExperimentType = c("Single enzyme SMF", "Double enzyme SMF", "No enzyme (endogenous mCpG only)"), 
                        substring = c("\\_NO_", "\\_DE_", "\\_SS_"), 
                        RelevanContext = c("DGCHN & NWCGW", "GCH + HCG", "CG"), 
                        Notes = c("Methylation info is reported separately for each context", "Methylation information is aggregated across the contexts", "-")))
MySample = suppressMessages(readr::read_delim(Qinput, delim = "\t")[[2]])
Region_of_interest = GRanges(seqnames = "chr6", ranges = IRanges(start = 88106000, end = 88106500), strand = "*")

Methylation = CallContextMethylation(sampleSheet = Qinput, 
                                     sample = MySample, 
                                     genome = BSgenome.Mmusculus.UCSC.mm10, 
                                     range = Region_of_interest, 
                                     coverage = 20, 
                                     ConvRate.thr = 0.2)
Methylation[[1]]
Methylation[[2]][1:10, 1:10]

The following chuck has the sole scope of downsampling reads to make the vignette lighter. The user should skip this.

n = 1000
readsSubset = sample(1:nrow(Methylation[[2]]), n)
Methylation[[2]] = Methylation[[2]][readsSubset,]

Plotting single sites

Already at this stage it is possible to visually inspect results. To that end, we provide the function PlotAvgSMF to plot the average SMF signal, defined as 1 - average methylation, at a genomic locus of interest. Optionally, the user can pass a GRanges object of genomic coordinates for the transcription factor binding sites (TFBSs) of interest to include in the plot, we show an example of such an object.

TFBSs = GenomicRanges::GRanges("chr6", IRanges(c(88106216, 88106253), c(88106226, 88106263)), strand = "-")
elementMetadata(TFBSs)$name = c("NRF1", "NRF1")
names(TFBSs) = c(paste0("TFBS_", c(4305215, 4305216)))
print(TFBSs)

PlotAvgSMF(MethGR = Methylation[[1]], range = Region_of_interest, TFBSs = TFBSs)

Furthermore, the function PlotSM can be used to plot the single molecule SMF information at a given site.

PlotSM(MethSM = Methylation[[2]], range = Region_of_interest)

Optionally, hierarchical clustering can be performed by setting the parameter SortedReads = "HC". This can be useful to aggregate reads visually in order to spot PCR artifacts. N.b. reads will be downsampled to 500.

PlotSM(MethSM = Methylation[[2]], range = Region_of_interest, SortedReads = "HC")

Single Molecule Sorting

Ultimately though, we want to classify reads based on their patterns of molecular occupancy. To that end we provide the functions SortReadsBySingleTF and SortReadsByTFCluster to classify reads based either on the occupancy patterns of one or multiple transcription factors. \

Under the hood, the classification is based on the definition of $n+2$ bins (with $n$ being the number of TFs). The $n$ bins are each centered around one of the TFBSs of interest, while the 2 extra bins are located upstream and downstream of the two outmost TFBSs. \

For SortReadsBySingleTF, the coordinates of the bins relative to the center of the TFBS are [-35;-25], [-15;+15], [+25,+35]. Instead, the function SortReadsByTFCluster draws a bin with coordinates [-7;+7] around the center of each TFBS, a bin with coordinates [-35;-25] relative to center of the most upstream TFBS and a bin with coordinates [+25,+35] relative to the center of the most downstream TFBS. The user can also employ custom coordinates by specifying them explicitly using the function SortReads. \

For each read, the binary methylation status of the cytosines contained in each bin is averaged to either a 0 or a 1 such that each read is eventually represented as sequence of $n+2$ binary digits, for a total of $2^{(n+2)}$ possible sequences. \

Here, we show a usage case for the SortReadsByTFCluster function, as we have already identified the double binding of NRF1 at the genomic site under scrutiny. Usage and exploration of the output is identical for the other function, except for the the format of the TFBSs argument which should consist of a GRanges object of length 1 for SortReadsBySingleTF and of length $>$ 1 for SortReadsByTFCluster.

SortedReads_TFCluster = SortReadsByTFCluster(MethSM = Methylation[[2]], TFBSs = TFBSs)
print(paste0("Number of retrieved states: ", as.character(length(SortedReads_TFCluster))))
print("States population:")
unlist(lapply(SortedReads_TFCluster, length))

N.b. non-populated states are not returned. \

The output of each of these sorting functions can be passed directly as the SortedReads argument of the PlotSM function.

PlotSM(MethSM = Methylation[[2]], range = Region_of_interest, SortedReads = SortedReads_TFCluster)

N.b. despite sorting reads by a TF cluster is in principle possible with clusters of any size, as of now the PlotSM function can only deal with TF pairs.

In order to be quantitative about these observations, the user can employ the StateQuantificationPlot. The function outputs a bar plot annotated with the percentage of reads found in each state. The function takes, as argument, the output of either of the two sorting functions.

StateQuantificationPlot(SortedReads = SortedReads_TFCluster)

Finally, we provide the wrapper function PlotSingleSiteSMF to plot at once the three kinds of information detailed above and to export results as a pdf.

PlotSingleSiteSMF(ContextMethylation = Methylation, 
                  sample = MySample, 
                  range = Region_of_interest, 
                  SortedReads = SortedReads_TFCluster, 
                  TFBSs = TFBSs, 
                  saveAs = NULL)


Krebslabrep/SingleMoleculeFootprinting documentation built on Nov. 19, 2022, 3:56 a.m.