knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
options("scipen"=10, "digits"=5)

Introduction

Hi-C is a technique that involves chromosome conformation capture followed by high-throughput sequencing. Unlike 3C, 4C or 5C, which target specific regions, Hi-C can provide genome-wide information about the spatial proximity of regions. The raw data takes the form of paired-end reads connecting restriction fragments. The resolution of a HiC experiment is limited by the number of paired-end sequencing reads produced and by the sizes of restriction fragments. To increase the power to distinguish real interactions from random noise, HiC data is commonly analysed in bins ranging from 20kb - 1Mb in size. There are a variety of tools available for binning the data, controlling for noise (e.g., self-ligations of restriction fragments), and finding significant interactions.

This vignette shows you how r Biocpkg("fugi") can be used to investigate significant interactions in Hi-C data. The data we are using comes from the @seitan2013cohesin study and can be downloaded from GEO. It involves Hi-C data from wild-type double positive murine thymocytes. The experiment was carried out using the HindIII restriction enzyme. The paired-end reads were aligned to the mm9 mouse genome assembly and the HOMER software [@heinz2010simple] was used to filter reads and detect significant interactions at a resolution of 100 kbp. For the purposes of this vignette, we will consider only data from chromosomes 14 and 15.

Making a GenomicInteractions object

Load the data by specifying the file location and experiment type.
You can also include an experiment name and description.

library(fugi)
hic_file <- system.file("extdata", 
    "Seitan2013_WT_100kb_interactions.txt", package="fugi")
hic_data <- makeGenomicInteractionsFromFile(hic_file, 
    type="homer", 
    experiment_name = "HiC 100kb", 
    description = "HiC 100kb resolution")
seqlengths(hic_data) <- c(chr15 = 103494974, chr14 = 125194864)
hic_data

The GenomicInteractions class provides a convenient representation of the genomic regions involved in pairwise interactions. Each interaction is considered to have two "anchor" regions, each of which is represented as a GRanges object. Readers are referred to the r Biocpkg("GenomicInteractions") vignettes for more details.

# Accessing anchors.
first(hic_data) 
second(hic_data)

# Accessing the common regions that are
# re-used across multiple interactions.
regions(hic_data, 1)
regions(hic_data, 2) # happens to be the same as regions(, 1).

Metadata for each interaction (e.g., p-values, FDR) is stored as a DataFrame accessed by mcols(), similar to the metadata of a simple GRanges. You can also access single metadata columns using $.

mcols(hic_data)
head(hic_data$LogP)

We can check that the first and second anchors are of the expected size (100 kbp). Some anchors are shorter than expected due to the bin being at the end of a chromosome.

summary(width(regions(hic_data, 1)))

There are r length(hic_data) interactions in total, with a total of r sum(interactionCounts(hic_data)) reads supporting them. To calculate the average number of reads per interaction, use interactionCounts() to get the number of reads per interaction:

head(interactionCounts(hic_data))
mean(interactionCounts(hic_data))

However, since we have FDRs and p-values, it is probably more informative to use these to find interactions of interest. Note that the FDR column in the dataset will be named differently depending on the number of interactions in your data. For simplicity in this document we will rename it!

hic_data$p.value <- exp(hic_data$LogP)
plot(density(hic_data$p.value))

hic_data$fdr <- hic_data$FDR.Benjamini..based.on.3.68e.08.total.tests.
plot(density(hic_data$fdr))

Summary statistics

r Biocpkg("fugi") provides some functions to plot summary statistics of your data that may be of interest, such as the percentage of interactions that are between regions on the same chromosome (cis-interactions), the percentage of interactions on different chromosomes (trans-interactions), or the number of reads supporting each interaction.

plotCisTrans(hic_data)
plotCounts(hic_data, cut=30)

These plots can be used to assess the level of noise in your dataset - for example, the presence of many interactions with high FDRs or low read counts suggests that the data may be noisy and contain a lot of false positive interactions. You can then subset the GenomicInteractions object by FDR or by number of reads to remove noisy interactions.

sum(hic_data$fdr < 0.1)
hic_data_subset <- hic_data[hic_data$fdr < 0.1]

plotCisTrans(hic_data_subset)
plotCounts(hic_data_subset, cut=30)

Subsetting by FDR will tend to remove interactions that are supported by fewer reads. Trans interactions tend to have fewer reads than cis interactions, so the percentage of trans interactions often decreases.

Annotation

Obtaining regions of integerst

One of the most powerful features of r Biocpkg("fugi") is that it allows you to annotate interactions by whether the anchors overlap genomic features of interest, such as promoters or enhancers. Genome annotation data can be obtained from, for example, UCSC databases using the r Biocpkg("GenomicFeatures") package. The code below obtains the promoters of Refseq genes extended to a width of 5 kbp:

## Not run
library(GenomicFeatures)
mm9.refseq.db <- makeTxDbFromUCSC(genome="mm9", table="refGene")
refseq.genes = genes(mm9.refseq.db)
refseq.transcripts = transcriptsBy(mm9.refseq.db, by="gene")
refseq.transcripts = refseq.transcripts[ names(refseq.transcripts) %in% unlist(refseq.genes$gene_id) ] 
mm9_refseq_promoters <- promoters(refseq.transcripts, 2500,2500)
mm9_refseq_promoters <- unlist(mm9_refseq_promoters[seqnames(mm9_refseq_promoters) %in% c("chr14", "chr15")])
mm9_refseq_promoters <- unique(mm9_refseq_promoters) # some duplicate promoters from different transcript isoforms

#get gene symbols
mart = useMart("ensembl", dataset="mmusculus_gene_ensembl")
genes <- getBM(attributes = c("mgi_symbol", "refseq_mrna"), filter = "refseq_mrna",
               values = mm9_refseq_promoters$tx_name, mart = mart)
mm9_refseq_promoters$geneSymbol <- genes$mgi_symbol[match(mm9_refseq_promoters$tx_name, genes$refseq_mrna)]

names(mm9_refseq_promoters) <- mm9_refseq_promoters$geneSymbol
na.symbol <- is.na(names(mm9_refseq_promoters))
names(mm9_refseq_promoters)[na.symbol] <- mm9_refseq_promoters$tx_name[na.symbol]

... while this next chunk obtains putative enhancers from http://chromosome.sdsc.edu/mouse/download.html, defined by @shen2012map using mouse ENCODE data.

#Not run
download.file("http://chromosome.sdsc.edu/mouse/download/thymus.zip", "thymus.zip")
unzip("thymus.zip")
thymus_enh <- read.table("thymus/thymus.enhancer.txt", sep="\t", stringsAsFactors = FALSE)
thymus_enh <- GRanges(seqnames=thymus_enh$V1, ranges=IRanges(thymus_enh$V2, width=1))
thymus_enh <- resize(thymus_enh, fix="center", width=500)
thymus_enh <- thymus_enh[seqnames(thymus_enh) %in% c("chr14", "chr15")]
names(thymus_enh) <- paste("ENH", as.character(thymus_enh), sep = "_")

Downloading all the data can be a slow process, so the data for chromosomes 14 and 15 are provided with this package.

data("mm9_refseq_promoters")
head(mm9_refseq_promoters)
data("thymus_enhancers")
head(thymus_enh)

Annotating by node classes

annotateInteractions takes a list of features in GRanges or GRangesList format, and annotates the interaction anchors based on overlap with these features. The list of annotation features should have descriptive names, as these names are stored in the annotated GenomicInteractions object and used to assign anchor (node) classes.

annotation.features <- list(promoter = mm9_refseq_promoters, enhancer = thymus_enh)
annotateInteractions(hic_data_subset, annotation.features)

In addition, the features themselves should have names or IDs. These can be the names() of the feature object, or an "id" metadata column (note lowercase). These names or IDs for each feature are stored in the metadata columns of the regions of the GenomicInteractions object. Each anchor may overlap multiple features of each type, so the columns containing feature names or IDs are stored as lists.

head(regions(hic_data_subset, 1))
head(regions(hic_data_subset, 1)$promoter.id)

Node classes (or anchor classes) are assigned to each anchor based on overlap with annotation features. Classes are assigned based on the order of those features within the list passed to the annotation function, whereby features earlier in the list take priority. For example, if the supplied list is list(promoter=..., transcript=...), an anchor that overlaps both a promoter and a transcript will be given the node class "promoter". Any anchors which are not annotated with any of the given features will be assigned the class "distal".

In this case, anchors can be "promoter", "enhancer", or "distal". As the anchors are large, most of them overlap at least one promoter or enhancer.

table(regions(hic_data_subset, 1)$node.class)

Interaction types

Interaction types are determined by the classes of the interacting nodes. As we only have three node classes, we have six possible interaction classes, summarised in the plot below. Most of the interactions are between promoters.

plotInteractionAnnotations(hic_data_subset, legend = TRUE)

We can subset the data to look at interaction types that are of particular interest. Distal regions interacting with a promoter may contain regulatory elements such as enhancers or insulators. To get all promoter-distal interactions:

length(hic_data_subset[isInteractionType(hic_data_subset, "promoter", "distal")])

This is more succinctly done using the is.pd wrapper function. Similar functions are available for other node class combinations, as well as for identifying cis or trans interactions.

length(hic_data_subset[is.pd(hic_data_subset)])
sum(is.trans(hic_data_subset))

In this case, we have annotated the anchors with known enhancer positions, so we can subset the data to get just enhancer-promoter interactions.

hic_data_ep <- hic_data_subset[isInteractionType(hic_data_subset, "promoter", "enhancer")]

To find the strongest promoter-enhancer interactions, we would do:

max(interactionCounts(hic_data_ep))
most_counts <- hic_data_ep[which.max(interactionCounts(hic_data_ep))]
most_counts

Or the most significant promoter-enhancer interaction:

min(hic_data_ep$p.value)
min_pval <- hic_data_ep[which.min(hic_data_ep$p.value)]
min_pval

The distance between these interacting regions, or any interacting regions, can be found using calculateDistances. For trans interactions, the distance will be NA.

calculateDistances(most_counts, method="midpoint")
calculateDistances(min_pval,method="midpoint")
summary(calculateDistances(hic_data_subset,method="midpoint"))

Visualising interactions of interest

The interaction with the highest number of counts in this dataset is between an anchor containing the promoter of a gene called Trib1, and an adjacent region containing more than ten putative enhancers.

anchorOne(most_counts)$promoter.id
anchorTwo(most_counts)$enhancer.id

r Biocpkg("fugi") provides methods to visualise interactions using the r Biocpkg("Gviz") package. For example, we can view interactions in the region around the Trib1 promoter by creating an InteractionTrack object:

library(Gviz)
Trib1_region <- resize(mm9_refseq_promoters["Trib1"], fix = "center", width = 1000000)
interaction_track <- InteractionTrack(hic_data_subset, name = "HiC", chromosome = "chr15")
plotTracks(interaction_track, chromosome="chr15", 
    from=start(Trib1_region), to=end(Trib1_region))

We can add more data to the plot to visualise features in this region and customise how this data is displayed. Here, interactions within the region of interest are coloured red, and interactions with other regions of chr15 are shown in blue. The height of the arcs representing the interactions is scaled to the number of counts supporting them.

promoterTrack <- AnnotationTrack(mm9_refseq_promoters, genome="mm9", name="Promoters",
    id=names(mm9_refseq_promoters),  featureAnnotation="id")
enhTrack <- AnnotationTrack(thymus_enh, genome="mm9", name="Enhancers", stacking = "dense")

displayPars(promoterTrack) <- list(fill = "deepskyblue", col = NA, 
    fontcolor.feature = "black", fontsize=8,
    just.group="below")
displayPars(enhTrack) <- list(fill = "black", col = NA)
displayPars(interaction_track) = list(col.interactions="red", 
    col.anchors.fill ="blue",
    col.anchors.line = "black",
    interaction.dimension="height", 
    interaction.measure ="counts",
    plot.trans=FALSE,
    plot.outside = TRUE, 
    col.outside="lightblue", 
    anchor.height = 0.1)

plotTracks(list(interaction_track, promoterTrack, enhTrack),
    chromosome="chr15", from=start(Trib1_region), to=end(Trib1_region), 
    sizes=c(0.6, 0.2, 0.2))

You can see what customisation options are available for a r Biocpkg("Gviz") track using availableDisplayPars(), and find more information about this and other track types in the r Biocpkg("Gviz") vignette.

Export to BED12 format

Interactions stored in a GenomicInteractions object can be exported to BED12 format for viewing in a genome browser. Anchors are visualised as thick blocks connected by thinner interactions.

## Not run
export.bed12(hic_data_subset, fn="hic_data_FDR0.1.bed", drop.trans = TRUE)

References



LTLA/fugi documentation built on June 22, 2019, 8:50 p.m.