The nullranges package contains functions for generation of sets of genomic ranges, represented as GRanges objects, for exploring various null hypotheses. For example, one may want to assess the significance of the overlap of two sets of ranges, by generating statistics of what would be expected under the null distribution of no relationship between these sets. We note that many other test statistics are supported via the flexible framework described here, combining nullranges with plyranges. The nullranges package contains a number of vignettes describing different functionality, from basic to more advanced usage. For a listing of all the vignettes in the package, one can type:

vignette(package="nullranges")

Choice of methods

The nullranges package has two distinct branches of functionality: matching or bootstrapping to generate null sets of genomic ranges.

In the vignette sections below we describe these two branches and give more formal definitions, where each branch has associated vignettes and man page sections (see Reference tab from the package website). To give a high level overview, we first provide a decision tree that helps to indicate the considerations when choosing between these branches of functionality.

Define features as a set of genomic locations of interest, these are minimally represented with a chromosome, start and width, and optionally strand and metadata (genomic ranges). Given a set of features, suppose that we wish to create an alternate set that represents a null feature set, sometimes also referred to as "background" or a "control set". For example, we may see that our primary features are close to transcription start sites, but are they closer than we would expect compared to a reasonable choice of null features?

Additionally, define pool as a much larger set of genomic locations compared to the primary features, and covariates as pieces of metadata attached to all of the considered features (stored as mcols(<GRanges>), which may be continuous, integer, factor, etc.).

Our choice of methods is informed by the following:

The following decision tree then informs what methods to choose:

DiagrammeR::grViz("digraph {
  graph [layout = dot, rankdir = TB]

  node [shape = rectangle]        
  source [label = 'Features defined with respect to a pool']
  poolyes [label = 'Yes']
  poolno [label =  'No']
  cov [label = 'Potential confounding covariates']
  covyes [label = 'Yes']
  covno [label = 'No']
  diffcov [label = 'Different covariate distribution']
  diffcovyes [label = 'Yes']
  diffcovno [label =  'No']
  match [label = 'matchRanges']
  random [label = 'Random sample']
  context [label = 'Local genomic context matters']
  contextyes [label = 'Yes']
  contextno [label = 'No']
  cluster [label = 'Features cluster in genome']
  random2 [label = 'Random sample']
  clusteryes [label = 'Yes']
  clusterno [label = 'No']
  boot [label = 'bootRanges']
  shuffle [label = 'Shuffle + exclusion']

  source -> poolyes
  source -> poolno
  poolyes -> cov
  cov -> covyes
  cov -> covno
  covyes -> diffcov
  covno -> random
  diffcov -> diffcovyes
  diffcov -> diffcovno
  diffcovyes -> match
  diffcovno -> random
  poolno -> context
  context -> contextyes
  context -> contextno
  contextyes -> cluster
  contextno -> random2
  cluster -> clusteryes
  cluster -> clusterno
  clusteryes -> boot
  clusterno -> shuffle
}", height = 500)

In summary, while matchRanges does not control for genomic distribution and clustering of features, bootRanges does not directly control for feature covariates independent of proximity and local context.

Related work

For general considerations of generation of null feature sets or segmentation for enrichment or colocalization analysis, consider the papers of @de_2014, @haiminen_2007, @huen_2010, @ferkingstad2015, @dozmorov2017, @kanduri_2019 (with links in references below). Other Bioconductor packages that offer randomization techniques for enrichment analysis include LOLA [@LOLA] and regioneR [@regioneR]. Methods implemented outside of Bioconductor include GAT [@GAT], GSC [@bickel_2010], GREAT [@GREAT], GenometriCorr [@GenometriCorr], ChIP-Enrich [@ChIP-Enrich], and OLOGRAM [@ologram].

We note that our block bootstrapping approach closely follows that of GSC, while offering additional features/visualizations, and is re-implemented within R/Bioconductor with efficient vectorized code for working with GRanges objects [@granges]. In addition, regioneR and OLOGRAM also offer randomization schemes that can preserve e.g. inter-feature distances.

In the context of these related works, a design choice of nullranges is its modularity: it solely generates null or background feature sets as Bioconductor objects, such that these can be used in workflows that compute and perform inference on arbitrary statistics within the plyranges [@Lee2019] framework. We have referred to such workflows as "fluent genomics" [@fluent], or "tidy genomics", as these workflows build up complex operations from commonly recognized verbs (filter, mutate, join, group by, summarize, etc.) strung together with pipes (%>%).

Further description of matching and bootstrapping

Suppose we want to examine the significance of overlaps of genomic sets of features $x$ and $y$. To test the significance of this overlap, we calculate the overlap expected under the null by generating a null feature set $y'$ (potentially many times). The null features in $y'$ may be characterized by:

  1. Drawing from a larger pool $z$ ($y' \subset z$), such that $y$ and $y'$ have a similar distribution over one or more covariates. This is the "matching" case. Note that the features in $y'$ are original features, just drawn from a different pool than y. The matchRanges method is described in @matchRanges doi: 10.1101/2022.08.05.502985.
  2. Generating a new set of genomic features $y'$, constructing them from the original set $y$ by selecting blocks of the genome with replacement, i.e. such that features can be sampled more than once. This is the "bootstrapping" case. Note that, in this case, $y'$ is an artificial feature set, although the re-sampled features can retain covariates such as score from the original feature set $y$. The bootRanges method is described in @bootRanges doi: 10.1101/2022.09.02.506382.

In other words

  1. Matching -- drawing from a pool of features but controlling for certain characteristics, or covariates
  2. Bootstrapping -- placing a number of artificial features in the genome but controlling for the local dependence among features (e.g. features clustering in the genome and/or having correlated metadata)

Options and features

We provide a number of vignettes to describe the different matching and bootstrapping use cases. In the matching case, we have implemented a number of options, including nearest neighbor matching or rejection sampling based matching. In the bootstrapping case, we have implemented options for bootstrapping across or within chromosomes, and bootstrapping only within states of a segmented genome. We also provide a function to segment the genome by density of features. For example, supposing that $x$ is a subset of genes, we may want to generate $y'$ from $y$ such that features are re-sampled in blocks from segments across the genome with similar gene density. In both cases, we provide a number of functions for performing quality control via visual inspection of diagnostic plots.

Consideration of excluded regions

Finally, we recommend to incorporate list of regions where artificial features should not be placed, including the ENCODE Exclusion List [@encode_exclude]. This and other excluded ranges are made available in the excluderanges Bioconductor package by Mikhail Dozmorov et al. [@excluderanges]. Use of excluded ranges is demonstrated in the segmented block bootstrap vignette.

References



nullranges/nullranges documentation built on Aug. 29, 2023, 12:13 a.m.