knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The GeneNeighborhood package provides statistical and graphical tools to analyze the direct neighborhood of sets of genes (called the "focus genes"). Here, "direct neighborhood" means the first upstream and downstream gene neighbors. We're interested in evaluating if these neighbors are enriched for a specific orientation/configuration and/or if their distance to the focus genes is shorter or larger than expected by chance.

The main questions that this package addresses are, for a given set of focus genes:

Let's start

Load libraries

Load the library:

library(GeneNeighborhood)

Throughout this vignette, we will also use other packages:

library(GenomeInfoDb)
library(knitr)
library(dplyr)
library(reshape2)
library(GenomicRanges)

Get some data

The GeneNeighborhood package provides a toy dataset named Genegr that is used in the README file on Github and throughout the examples in the documentation of the functions.

For a more realistic example, we will work on the Arabidopsis thaliana gene annotations. These are available as a package in Bioconducor:

library(TxDb.Athaliana.BioMart.plantsmart25)

If you can't find a Txdb package for your organism in Bioconductor or AnnotationHub, it is relatively easy to create your own package, using for example the GenomicFeatures or the ensembldb package.

We get the gene annotations from the TxDb package using:

gn <- genes(TxDb.Athaliana.BioMart.plantsmart25)

In order to limit the computation time but still work on a realistic example we keep the first 2 chromosomes:

gn <- GenomeInfoDb::keepSeqlevels(gn, 1:2, pruning.mode = "coarse")

This represents r length(gn) genes.

Obtain information on the upstream/downstream genes

Neighborhood definition

The first step in our analysis consists in extracting from the GRanges object the information about the neighbors. While this might seem trivial (e.g. using appropriately the precede and follow functions from the GenomicRanges), the presence of overlapping annotations creates a variety of cases that can be encountered throughout the genome that makes things quickly complicated.
For now, the GeneNeighborhood package only analyzes genes that either don't overlap with any other genes (the easy case) or that overlap with only one other gene, as illustrated here:

Obtaining gene neighborhoods

For each feature/gene, we extract information (orientation and distance, potential overlaps) about their upstream/downstream neighbors with:

GeneNeighbors <- getGeneNeighborhood(gn)

Here, we evaluate the neighborhood of all genes but in some situations we may want to filter the gn object before extracting genes neighborhoods. For example, if we are interested in potential local interactions between expressed genes we may filter gn to keep only genes that are expressed / transcribed.
Also, we focus on genes here but the functions of this package may be used to study the neighborhood of any other type of annotations (e.g. promoters, motifs, transposable elements, etc.), as long as they are oriented.

Analyze the orientation of the neighbors

If we have a set of genes, defined by any other mean, we may ask if their upstream and downstream neighbors tend to be in a specific orientation more than expected by chance.

We illustrate this with a random set of 400 genes:

set.seed(123) # for reproducibility
randGenes <- sample(names(gn), 400)

We extract the statistics about the orientation of their neighbors using:

## Neighbors Orientation Statistics:
NOS <- analyzeNeighborsOrientation(randGenes, 
                                   GeneNeighborhood = GeneNeighbors)

By default all genes are used as a universe and an Fisher exact test is performed to evaluate the enrichment in a specific orientation.
By default, the function also analyzes the "other" orientation but an enrichment in this category is generally not directly interpretable. We can remove this orientation from the analysis using:

NOS <- analyzeNeighborsOrientation(randGenes, 
                                   GeneNeighborhood = GeneNeighbors,
                                   keepOther = FALSE)

We obtain the following table:

NOS  %>% 
  dplyr::mutate(p.value = signif(p.value, 2)) %>%
  knitr::kable(digits = c(1,1,1,2,1,2,300), 
               format = "markdown",
               caption="100 random genes")

We can plot the corresponding percentages using:

plotNeighborsOrientation(NOS)

The analysis of neighbor genes'orientations allows to detect an enrichment for a specific orientation ("SameStrand" or "OppositeStrand") in the absence of overlap, as well as the situations where the upstream or downstream gene overlaps with the focus genes ("SameOverlap"" or "OppositeOverlap"" orientations).

Analyze the proximity of the neighbors

Whether or not there is a significant enrichment for a given orientation, we may ask if the gene neighbors are located at a specific distance from the focus genes, i.e. closer or farther than expected by chance.

Extracting the distances to the neighbor genes:

We can extract the distances to both the upstream and downstream genes with:

randDist <- dist2Neighbors(GeneNeighborhood = GeneNeighbors,
                           geneset = randGenes,
                           genesetName = "RandomGenes")

The function automatically filters out genes with undefined upstream or downstream genes (e.g. when a gene is located at a chromosome border).

The resulting table looks like this (top 6 rows only):

head(randDist, n = 6) %>%
  knitr::kable(format = "markdown",
               caption="Distances to gene neighbors")

The function also be used with a named list containing different gene sets:

myGSdist <- dist2Neighbors(GeneNeighborhood = GeneNeighbors,
                           geneset = list("RandomGenes" = randGenes, 
                                          "AllGenes" = names(gn)))

For a single gene set and a single side (upstream or downstream) the distances can be extracted with the underlying function:

randDist_Upstream <- getDistSide(GeneNeighbors, randGenes, "Upstream") 

which gives the following table (top 6 rows):

head(randDist_Upstream, n = 6) %>%
  knitr::kable(format = "markdown",
               caption="Distances to upstream gene neighbors")

Create a new gene set with close upstream neighbors

To illustrate how a specific proximity pattern can be detected and represented, we will use the intergenic distances that we can now extract to define a subset of genes that are enriched for genes with short upstream intergenic distances, whatever the orientation (same or opposite strand).

Let's extract all upstream distances:

allUpstrimDist <- getDistSide(GeneNeighbors, names(gn), "Upstream")

Then we define a probability of selecting a gene that is inversely proportional to its upstream distance:

#We replace extreme values by the 99th percentile
updistQ <- pmin(allUpstrimDist$Distance,
                quantile(allUpstrimDist$Distance,0.99)) 
updistQ[updistQ<1e3] <- floor(updistQ[updistQ<1e3]/2) #we give more weight to distances below 1Kb
updistQ[updistQ>1e3] <- pmax(updistQ[updistQ>1e3]*1.5, max(updistQ)) #and less weight to distances above 1Kb
probs <- (max(updistQ) - updistQ) / 
         sum(max(updistQ) - updistQ)

Then we select 800 genes using these probabilities:

set.seed(123)
closeUpstreamNeighbors <- sample(allUpstrimDist$GeneName, 
                                 800, 
                                 prob=probs)

We now have 2 sets of genes:

  1. A set of 400 genes that were randomly selected (randGenes)
  2. A set of 800 genes that are enriched for close upstream neighbors (closeUpstreamNeighbors)

Let's assemble these sets together with the "universe" (i.e. all genes studied) in a list:

MyGeneSets <- list("RandomGenes" = randGenes,
                   "CloseUpstreamNeighbors" = closeUpstreamNeighbors,
                   "AllGenes" = GeneNeighbors$GeneName)

and extract the distances to the gene neighbors for all these genes:

Dist2GeneSets <- dist2Neighbors(GeneNeighbors, MyGeneSets)

Obtain descriptive statistics on the intergenic distances

We can obtain descriptive statistics for each side (upstream/downstream), each orientation (same or opposite strand) and each gene set with the distStats function:

MyDistStats <- distStats(Dist2GeneSets, 
                         confLevel = 0.95,
                         nboot = 500, 
                         CItype="perc",
                         ncores = 2)

The function also calculates bootstrap confidence intervals for the mean and median using the boot package.
Note: To reduce computation time, we use only 500 boostrap samples to compute the confidence intervals of the mean and median but a value at least equal to the default 1e4 should be preferred. The function also supports parallel computing (not available on Windows) via the ncores argument

We obtain the following table of statistics:

MyDistStats  %>% 
  knitr::kable(digits = 0,
               format = "markdown",
               caption="Descriptive statistics on intergenic distances")

Test for differences in intergenic distances

Evaluating if a set of genes has intergenic distances with their neighbors that are shorter or longer than expected by chance can be formulated in different ways.

Here too, the distances between the genes and their neighbors can be compared between those for the selected set of genes (GeneList) and thos for the entire universe (which defaults to all genes provided in the GeneNeighborhood argument but can be adjusted with the GeneUniverse argument). We implement 3 statistical tests to compare the results obtained for the gene set vs those for the universe:

All tests are performed in their bilateral version (i.e. with a two-sided alternative hypothesis)

MyDistTests <- distTests(Dist2GeneSets,
                         Universe = "AllGenes",
                         MedianResample = TRUE,
                         R = 1e3)

Table:

MyDistTests  %>% 
  dplyr::mutate(KS.pvalue = signif(KS.pvalue, 2),
                Wilcox.pvalue = signif(Wilcox.pvalue, 2),
                Indep.pvalue = signif(Indep.pvalue, 2),
                Median_resample.pvalue = signif(Median_resample.pvalue, 2)) %>%
  knitr::kable(digits = c(0,0,0,300,300,300,300), 
               format = "markdown",
               caption="Test statistics on intergenic distances")

Plotting the results:

We can plot the distribution of intergenic distances for the different sets of genes:

plotDistanceDistrib(Dist2GeneSets)

Or using violin plots:

plotDistanceDistrib(Dist2GeneSets,
                    type = "violin")

When relatively few genes are analyzed, it is worth looking at individual points. They can be plotted, along with a boxplot, under the density plots, using:

plotDistanceDistrib(Dist2GeneSets,
                    type = "jitterbox")

Metagene profiles

Another way to study the neighborhood of a set of genes is to produce an average profile representing the coverage of gene annotations in regions surrounding the borders of genes in this set.
Such a representation provides information on both the orientation and the distance of the neighbors. It also allows to compare different groups of genes.

The steps to obtain these profiles are:

When gene annotations overlap on the same strand, we can decide to keep this information (i.e. we can get coverage values >1) or to only focus on presence/absence of annotations (i.e. strand-specific coverage is 0 or 1) using the argument usePercent = TRUE (in this case, the average of 0/1 values for aligned signals is a percentage of bases covered with annotations).

To extract +/-2Kb of signal on each side of all genes:

Prof <- annotationCoverageAroundFeatures(gn,
                                         sidedist = 2000,
                                         usePercent = TRUE,
                                         nbins=10)
#Prof may be pre-compiled to avoid long computation?
# For alternative option, see 
#https://community.rstudio.com/t/precompiling-vignette-with-devtools/1583/2
#https://community.rstudio.com/t/vignettes-that-require-locally-stored-data/2392/2

Prof <- readRDS("Prof.rds")

The Prof object is a list with elements:

cat(paste(names(Prof), collapse="\n"))

Each element is an RleList containing the strand-specific coverage of annotations on the gene body (Feature), in the region (-2kb to the start of the gene) located upstream of the genes (UpstreamBorder) and in the region (end of the gene to +2kb) located downstream of the genes (DownstreamBorder).
The Sense strand is the strand on which the focus gene (Feature) is annotated and the Antisense strand is the opposite strand.

These elements are now assembled into gene-level vectors containing the upstream region, gene body and downstream region:

Prof <- assembleProfiles(Prof)

Now we define groups of genes on which we want to calculate an average profile and its associated confidence interval:

GeneGroups <- list("AllGenes" = names(gn),
                   "RandomGenes" = randGenes,
                   "CloseNeighbors" = closeUpstreamNeighbors)

For each group, we then calculate the average profile and 95% confidence interval:

avgProf <- list()
for (i in 1:length(GeneGroups)) {
  avgProf[[i]] <- list()
  avgProf[[i]]$sense <- getAvgProfileWithCI(Prof$Profiles_Sense,
                                            selFeatures = GeneGroups[[i]],
                                            pos = c(-2000:0, 1:10, 0:2000))
  avgProf[[i]]$antisense <- getAvgProfileWithCI(Prof$Profiles_Antisense,
                                                selFeatures = GeneGroups[[i]],
                                                pos = c(-2000:0, 1:10, 0:2000))
}
names(avgProf) <- names(GeneGroups)

Before plotting, we need to assemble these profiles in a single table. We must provide the x-coordinates for these "metagene" profiles. The x-coordinates should be in [0,5] with the upstream region covering [0,2[, the TSS having x-coordinate 2, the gene body covering ]2,3[, the TSS having coordinate 3 and the downstream region covering ]3,5].

xcoord = c(seq(0, 2, length.out = 2001),
           seq(2, 3, length.out = 12)[2:11],
           seq(3, 5, length.out = 2001))

And we assemble the metagene profiles in a data.frame with:

avgProf_df <- reshape2::melt(avgProf,
                             measure.vars = "Profile", value.name = "Profile") %>%
                  dplyr::rename("Strand" = "L2",
                                "GeneSet" = "L1") %>%
                  dplyr::mutate(Xcoord=rep(xcoord, 2*length(GeneGroups)))

Now these profiles can be plotted:

plotMetageneAnnotProfile(avgProf_df, usePercent = TRUE)

By default, the plot goes all the way up to 100% coverage, which is the value on the gene body for the sense strand. To change this behavior, we can use:

plotMetageneAnnotProfile(avgProf_df, usePercent = TRUE) +
    ggplot2::coord_cartesian(ylim=c(0,50))

Distance to single-point features

One limitation of these average profiles is that all the length of the neighborhing genes/features is taken into account to produce the profiles. For example, we see the enrichment of close upstream neighbors in the average profiles ("CloseNeighbors" group), but we cannot directly answer a question such as: "What is the fraction of genes from this (and other) set that have a neighboring TES, on the same strand, at less than 600bp".

In order to do this, we first extract the TES, which are "single-point features" located at the (3') end of the genes.

tes <- getTES(gn)

If we were interested in TSS (transcription start site), we could get their coordinates using:

tss <- GenomicRanges::promoters(gn, upstream = 0, downstream = 1)

Then we extract the coverage of TES around the genes (here +/- 1kb) using (we only use 3 bins for each gene as we are not interested to look at gene bodies here) :

tescov <- annotationCoverageAroundFeatures(annot = tes, 
                                           features = gn, 
                                           sidedist = 2000, 
                                           usePercent = TRUE,
                                           nbins = 3)

Now, in order to obtain cumulative presence of TES as we move away from the border of our genes of interest, we first extend the "presence" (coverage=1) value of each TES to any distance larger than the location of each TES. For example, if a TES is found at 100bp upstream of a gene, then we count it as "present" (coverage =1) for any distance greater or equal to 100bp.

extTEScov <- extendPointPresence(tescov, sidedist = 2000) 

For different groups of genes we now obtain the cumulative percentage of genes that have a TES at less than a given distance with:

cumulTES <- getCumulPercentProfiles(extTEScov,
                                    genesets = GeneGroups)

and plot the results:

plotCumulPercentProfile(cumulTES)

Case studies



pgpmartin/GeneNeighborhood documentation built on Sept. 2, 2021, 6:37 a.m.