knitr::opts_chunk$set(
  warning = FALSE,
  message = FALSE
)

About the experiment

The experiment we'll be running through is an ATAC-cap-seq experiment of Arabidopsis plant leaves taken from plants exposed to either a mock (water) treatment or infected with a pathogen. We have the bare minimum replicate number, just three independent samples for each of the mock or infected treatments. The reads are paired-end and 50 nt long. The BAM files are sorted using samtools from SAM files generated by BWA but at the start no further processing has been done. The paths to the BAM files and the bait region coordinates on your system are described above.

Preparing input files

The first step of any atacR analysis is to build the input files, these are basically files listing where the important data files are on your system. You'll need a file listing the BAM files and a file listing the bait windows. These are described more fully in the loading vignette.

The atacR package comes with a small set of ATAC-cap-seq data built in. It is installed along with the package, and we'll use that data in this tutorial.

A convenience function is built into atacR that will find the built-in data and build the files you need to follow this tutorial. All you need to do is decide where those files should be written. Below we write them to the Desktop. Two files will appear on the Desktop - bait_regions.gff and sample_treatment_bam_mappings.csv. If you inspect these you'll see the structure.

library(atacr)
input_files <- make_tutorial_data("~/Desktop/")
input_files

The object input_files holds the paths to the input files we made, so we'll use that to get going.

Generating read counts

Once we have the files ready, we can begin analysis. We can extract information from the files and make the counts we're interested in with the make_counts() function. In this step we'll set the read filter parameters to decide which reads and alignments in the BAM file are of sufficiently good quality to be counted.

Depending on whether you have ATAC-cap-seq or RNA-cap-seq this function does slightly different things. If you have ATAC-cap-seq, this function divides the bait regions in the genome into sub-windows of a fixed width. If you have RNA-cap-seq the whole bait region is considered to be a single window.

Below we will use the default window sizes (50nt, non-overlapping) and read filters (described in the loading vignette ). As this is ATAC-cap-seq data we need to specify that too.

counts <- make_counts(input_files$bait_regions_file, 
                      input_files$mapping_file,
                      is_rnaseq = FALSE
           )

The resulting object counts has a few slots containing information. The most important are bait_windows which describes the windows in the bait regions and non_bait_windows which describes all the spaces in between the bait_windows. By defauly all functions will work on bait_windows but you can change the subset using the which parameter (see the atacr which vignette for more information. )

Summarising data

Once everthing is loaded, it is a good idea to check the counts object is as you expect. The summary() function does this.

Summary statistics

summary(counts)

The summary is very long but worthwhile. A feature of atacR is that it keeps counts in non-bait region windows. Non-bait region windows are those outside the bait regions. The non-bait regions are not the same size as the bait window regions - A single non-bait window covers all the space between the last window of one bait region and the first window of the next.

As we can see the coverage in this small sample is relatively low - that's an artefact of small files to keep the tutorial running quite quickly. But most windows have an average of ~ 10 counts and the off-target reads are very low. < 1 %.

Summary and QC plots

The atacR package has a range of summary and QC plots for visualising different aspects of the data.

The samples can be inspected through plots. The standard plot function creates a few summary style plots enabling you to view coverage distribution and region density. As it summarises windows, the more windows you have, the slower it runs!

plot(counts)

A coverage threshold plot can reveal the number of windows that have coverage lower than a given value.

windows_below_coverage_threshold_plot(counts)

Here we can see that mock_rep1 and mock_rep3 have fewer windows below the coverage threshold, so are generally better covered.

We can see from all these that although the read mapping and filtering is specific to the quite a lot of windows (~ 2000 in each sample) have counts of 0, which indicates that some of the DNA in the sequence regions was not sampled. You may wish to play with window size settings to see how robust this phenomenon is to window size. Increasing the window size will likely reduce the zero count windows number by merging counts from adjacent windows. Decreasing the window size will likely increase zero count windows. The level of granularity you use will be study dependent and if you intend to conclude absence of counts (e.g for detection of closed chromatin) then you'll want to be very careful with comparison to specific control windows to make that comparison.

Specific window counts

You can examine specific window counts quite easily. The internal object holding the data is of class SummarizedExperiment, which is part of BioConductor, so you can use functions in standard BioConductor packages to interrogate them. Here's how you might get information on specific window counts.

First you must create a region of interest. Use (GenomicRanges)[https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html] package to do this.

roi <- GenomicRanges::GRanges(seqnames = "Chr1", ranges = 245951:246250)

Next, subset the window set of interest with (IRanges)[https://bioconductor.org/packages/release/bioc/html/IRanges.html]

small_section <- IRanges::subsetByOverlaps(counts$bait_windows, roi)

The resulting object is a SummarizedExperiment which doesn't print literally, as it can be quite big. To get at the actual count matrix, use the assay function.

SummarizedExperiment::assay(small_section)

Sample reproducibility

A PCA plot can be used to examine the similarity between the different samples. Here we can see that two of the infected replicates are way off from each other and the other more similar samples. You can use these plots to identify any samples that are extremely different from the others. As all of the infected samples are quite different in different ways, we may be seeing just a large amount of experimental variability in our results, which can be important too. So we'll proceed with the data, keeping in mind that variability may be large and for particular treatments, we may need to gather more replicates.

sample_pca_plot(counts)

An MA plot can show you eccentricities in each sample (See the wiki page for more information)[https://en.wikipedia.org/wiki/MA_plot]. In the atacR MA plot a common reference is used, the median value for a windows as a common denominator for sample.

ma_plot(counts)

In this MA plot we see some structure in the data, the strong lines in each subplot indicate the points with zero for the count. The infected overall show higer counts than the mock. The usual assumption of most windows not changing between samples may not hold, here as the clouds of points seem quite shifted between mock and infected.

Normalisation

The normalisation step helps us to reduce systematic between-sample variability. Sequence data are hard to normalise, and cannot be normalised well by simple scaling. For RNASeq data there are numerous methods such as FPKM etc that sort of normalise. The best approaches with ATAC-cap-seq data are to find the least varying windows, then calculate factors and use those to scale the rest of the data with.

atacR provides three types of normalisation. These are

  1. Library size
  2. Scale factor
  3. Goodness of Fit

The best of these is 3. Goodness of Fit. It is fast, automatically finds the least varying and best features in the data to normalise with and does a reasonable job of between-sample normalisation. It is usually the best one to choose. It is particularly useful when you don't know whether many windows will be changing or just a few will be, as it should perform the same regardless.

The Library size normalisation is the most basic and the one that most studies seem to use for normalisation - the basis of this is that each count is divided by the mean count for all samples in that treatment the sample. For most ATAC purposes this will be underpowered, because the low number of windows or high proportions of changing windows will cause skew between samples. This method useful when you have reasonably high counts (> 20 mean) and you are certain few windows (< 10%) will display differential counts.

The Scale factor normalisation is provided to allow interaction with other normalisation from other packages. With this you provide a number for each sample and the counts in each sample are divided by the respective number. It is only useful when you have some other method that generates factors that you wish to use to scale counts.

Check out the normalisations vignette for further information.

Goodness of Fit normalisation

Here we'll run Goodness of Fit (GoF) on the sample data. First step is to run the GoF code and find the most stable windows across the samples to use to normalise.

auto_controls <- find_controls_by_GoF(counts)

We can use these to check the selected control windows have lower GoF than the non-selected windows using the plot_GoF() function

plot_GoF(counts, controls = auto_controls)

They are better. They have a lower, spikier mean Goodness of Fit. The Non-control data has a long tail distribution so the difference is quite pronounced. So we can use now generate the normalisation factors and apply them. We'll save the resulting information to a new slot in the counts object. Then we'll plot the pre- and post- normalised data to see the effects of the normalisation

gof_norm_factors <- get_GoF_factors(counts)

gof_normalised_counts <- scale_factor_normalise(counts, 
                                          scaling_factors = gof_norm_factors)

counts$normalised_counts <- gof_normalised_counts


plot_counts(counts)
plot_counts(counts, which = "normalised_counts")
ma_plot(counts)
ma_plot(counts, which = "normalised_counts")

We can see that the distributions get a little closer to each other and that the spread in the data in MA plots is reduced a little. The variability in these data are quite high though. See the normalisations vignette for further discussion.

Differential window counts

Once you are happy with the normalisation, you can try to estimate which windows have differential counts. atacR gives you three methods.

  1. edgeR exact test - this is a wrapper around the edgeR method for single factor designs, using the estimateDispersion method. This method was designed for genome wide studies so works best when only a few of the (~5 %) of the windows are expected to have differential counts. It is the most sensitive in this situation though.
  2. bootstrap t test - this is a brute force method that uses resampling of each windows sample counts and recalculating of the Student's t statistic to come up with a background distribution of t. If the observed t is at the edges of this distribution, differential counts are called. This method is useful when any number of the windows may show differential counts.
  3. Bayes Factor test, this calculates the Bayes factor for each window. The ratio of the Bayes factor for control and test is returned on a window by window basis. If the ratio is over a given number (4 by default) a differential count is called. This method is useful when any number of the windows may show differential counts.

See the differential windows vignette for further discussion

We can perform differential analysis in the following ways, we'll use the which argument (see which vignette ) to make sure we analyse the normalised counts.

 edgeRexact_result <-  edgeR_exact(counts,
                which = "bait_windows",
                treatment_a =  "infected",
               treatment_b = "mock",
                remove_zeros = TRUE)

bootstrap_result <- estimate_fdr(counts,
              which = "normalised_counts",
              treatment_a =  "infected",
              treatment_b = "mock",
              iterations = 10
)

bayesfactor_result <- estimate_bayes_factor(counts,
              treatment_a = "infected",
              treatment_b =  "mock",
              which = "normalised_counts"
              )

The resulting dataframe holds the result of these calculations

head(bootstrap_result)

head(bayesfactor_result)

head(edgeRexact_result)

Each of these methods works on single factor designs, there is a multiclass variant that works on common control designs. See the differential windows vignette for further discussion of these.

bf_multi <- estimate_bayes_factor_multiclass(counts, "mock",
              factor = 0.5,
              which = "normalised_counts"
              )

head(bf_multi)

fdr_multi <- estimate_fdr_multiclass(counts, "mock",
             fdr_level = 0.05,
              which = "normalised_counts"
             )

head(fdr_multi)

The edgeR_multiclass() function does not return a dataframe, instead it returns the native DGELRT objects (see the DGELRT manual for more information) from each comparison in a list() object with names as per the treatment used.

edgeR_multiclass(counts,"mock", 
  remove_zeros = TRUE, 
  which = "bait_windows")


TeamMacLean/atacr documentation built on May 9, 2019, 4:24 p.m.