In AnnikaGable/chromium: Analyze and visualize high throughput chromatin interaction data, such as Hi-C or ChIA-PET

This package was designed to analyze genome-wide chromatin conformation data. We use the InteractionSet class of objects to represent the data. We implement a cosine similarity based measurement of spatial proximity between chromatin regions. In addition, we provide functionalities for data binning, normalization, and visualization.

In a typical workflow we:

read in the data
bin and count the contacts between the bins to generate ligation frequency matrices (LFM)
normalize the LFM with an iterative proportional fitting (IPF) algorithm
apply cosine similarity to the normalized LFM
create whole chromosome interaction maps
visualize interaction matrices along with chromatin modifications and genomic annotatation tracks such as gene models and other customizable annotation tracks
call significant interactions

Import

There are two possible input formats for chromium: You can either provide the interaction data as InteractionSet objects, or in the simple file format generated by the preprocessing scripts provided in the Appendix.

Using input data of the `InteractionSet` class

Our tool uses the InteractionSet class to represent chromatin conformation data. For more details on the InteractionSet class please refer to the InteractionSet package vignette. In essence, an InteractionSet object stores information about both the number of interactions between user defined genomic intervals and the annotation of the genomic intervals.

Using input data generated by the pre-processing scripts from the Appendix

In order to use the files generated with the Python pre-processing pipeline we provide in the Appendix, we need to transform them to the InteractionSet class. This can be done with our import function. import requires the name of the .bed file containing the genomic annotation of the restriction fragments generated by the enzyme used for the chromatin conformation experiment. The next argument is the path to all input files (defaulting to the current working directory), and a list of .raf file names. If no .raf file names are given, all .raf files from the directory will be used as input and appended to each other.

As an example, we can import the tethered conformation capture (TCC) data for chromosome 2 from Pekowska et al., 2016 and transform it into an InteractionSet object. While these are intrachromosomal data, you may also use a whole genome matrix with our functions.

devtools::load_all()
library(Matrix)
library(InteractionSet)
library(chromium)

#chr2 <- import(bed='RFanno_HindIII.bed', raflist = list('chr2.raf'), workDir = "~/chromium/inst/extdata")
chr2 <- import_chrom(bed = 'RFanno_HindIII.bed', raflist = list('chr2.raf'), workDir = file.path(system.file("inst/extdata", package = "chromium")))

Using toy example data files

This package also includes very small "toy" example files, a genome annotation file (exRFanno.bed), and several restriction fragment pair files (*.raf). Load the example:

toyIset_imported <- import_chrom(bed = 'exRFanno.bed', raflist = list('exRFpairs.raf'), workDir = file.path(system.file("inst/extdata", package = "chromium")))

Inspecting the InteractionSet object

But back to the chromosome 2 data: The resulting InteractionSet object is also included in the package and can be directly loaded using:

data(chr2)

The genomic annotation is now saved as a GenomicRanges object:

regions(chr2)

And the interactions are a GInteractions object:

interactions(chr2)

We can see here that the data actually contains not only intrachromosomal interactions from chromosome 2, but also interactions between chromosome 2 and other chromosomes. In the Subsetting section we will see how to limit both the interactions as well as the annotation to chromosome 2 only.

The metadata of this InteractionSet are empty so far:

metadata(chr2)

Subsetting the InteractionSet

If you choose to use only certain chromosomes or regions for your analysis, we provide a possibility to easily subset your data. The ´from´ and ´to´ parameters signify the genomic positions of the start and end of your subset, and they are optional. Subsetting will work in the same way for binned data, without the need to change the to and ´from´ values.

chr2_subset <- subset_chrom(chr2, chr=2)

print(object.size(chr2), units='auto') # the full data for chr2 and the annotation for all chromosomes
print(object.size(chr2[1:1]), units='auto') # just one interaction, but the complete annotation for all chromosomes
print(object.size(chr2_subset), units='auto') # chr2 with one million interactions, and only the corresponding annotations.

The resulting ´InteractionSet´ object contains only those interactions within the selected genomic region, and only the genomic annotation for this region (for chr2 in this case). This significantly reduces the object size in comparison to the standard method of subsetting included in the ´InteractionSet´ package.

Binning

Having imported the chromatin conformation data, the next step consists in the choice of the binning approach. Ideally, data would be analyzed at its technical resolution i.e. the restriction fragment resolution. However, this is difficult to obtain and usually data are binned into bins containing several fragments to increase the signal to noise ratio.

The binning is performed using bin_chrom function. It requires two arguments, the InteractionSet object along with the preferred bin size. The bin size can be set to any size larger than the restriction fragment size. A larger bin size will result in lower matrix resolution but faster calculations.

Let's bin our chromosome 2 contact matrix into 10kb bins:

chr2_binned = bin_chrom(chr2, binSize = 100000)

The result is still an InteractionSet object, with the bin size stored in the metadata.

head(regions(chr2_binned))
head(interactions(chr2_binned))
metadata(chr2_binned)

Normalization

The next step in the analysis is the matrix balancing step resulting in the normalized contact matrices.

The number of interactions between two genomic regions in the contact matrix reflects their crosslinkability, and thus their spatial proximity. This value is however influenced by the overall 'visibility' of the two interacting partners. Differences in the overall visibility of genomic regions may be caused by sequence composition (CpG rich versus CpG poor), restriction enzyme accessibility, mappability, restriction fragment length and other unknown sources. Therefore there is a need to remove these effects from the data to infer the true contact strength between regions in the matrix. As suggested by Imakaev et al., one of the most powerful methods to normalize the LFMs is to perform the iterative proportional fitting. In this approach, for each bin of the interaction matrix connecting row $i$ with column $j$, the interaction signal $x_(ij) > 0$ is divided by the product of the row sum of $i$ and the column sum of $j$. This normalization process is repeated for the number of iterations chosen. In this way we correct the observed interaction strength between row $i$ and column $j$ by the signal we would expect given only the overall visibility of these regions. All row sums and column sums of the resulting normalized interaction matrix will have a value of 1. We implement the IPF following Pukelsheim and Simeone [@Pukelsheim2009] as this algorithm ensures either matrix convergence or it results in an oscillation of the matrix around the converged state.

To allow the user to assess the normalization quality and choose the number of iterations to perform, we compute a loss function which reflects the distance between the corrected and fully converged matrix at each iteration.

chr2_bin_norm = normalize_chrom(chr2_binned, numberOfIterations = 20)

Visualization

We provide a visualization function tailored to generate publication-quality interaction heatmaps along with genomic annotation tracks such as gene models, CpG island annotation or local interaction zone annotation. It is also possible to include chromatin modification data (ChIP-seq tracks) and gene expression (RNA-seq tracks). The arguments of our function visualize are the InteractionSet, the genomic coordinates of the region to be displayed, the data tracks (RNA-seq and/or ChIP-seq data, optional), custom annotation track data (optional), and smoothing parameters smoothing, boxSize and sigma. To generate the gene models there are three possibilities: The user can provide either an R object containing the gene models prepared by themself or the organism identifier which will be used to load the most recent gene models from biomart. The gene models for the two most recent human and mouse assemblies are included in the package for reproducibility and to . To choose these already prepared gene models, give an assembly identifier instead of the organism identifier i.e. "mm9", "mm10", "hg19", or "hg38".

Smoothing of the interaction matrices

We provide an option to display smoothened interaction profiles using our visualisation function. To this end, we implemented two alternative approaches. In the first approach, the user provides the size of the box used for smoothing along with the smoothing strength, i.e. the relative contributions of the central pixel and the surrounding pixels to the final smoothened signal. In the second approach, particularly useful for more aggressive smoothing, the user provdes the size of the box along with the sigma parameter for a gaussian smoothing function.

Appendix

We include here the necessary steps to obtain filtered aligned reads from the fastq files. The filtering includes: - trimming of the reads and removal of the 5 prime end of the read if the restriction enzyme sequence has been found in the read - read alignemnt - removal of poorly mapped reads - annotation of the reads to the closest restriction fragment in the genome - removal of reads mapping to mitochondrial genome - removal of the reads which display abnormal orientation (reference: Jin et al. 2013) - removal of reads which map within the sonication size - removal of the PCR duplicates which are reads with the same start and end positions.

The generated interaction files will be ready to use with the package.

Preprocessing of raw sequencing reads with Python

In this section we describe the preprocessing of the raw sequence files. To run the preprocessing you need to install the Python package HTSeq. Information on how to install this package can be found here: http://www-huber.embl.de/users/anders/HTSeq/doc/install.html#install For our analyis we used Python 2.7 (r27:82500, May 19 2011, 14:00:50) and HTSeq 0.5.3p9ur1.

Trimming at ligation site

Due to the HiC protocol reads from short restriction fragments could contain sequence regions from two restriction fragments. In such cases, the first regions corresponded to first restriction fragment and the second region to the interacting fragment that also can be found in the second read mate. The sequence of such a religation event is “AAGCTAGCTT”. To prevent mapping ambiguities we searched for reads containing this sequence and trimmed them at the religation site keeping the first five bases (“AAGCT”). The following shell script does this for all files containing C1HW8ACXX E15 N14 13s002413*S in the rawInput folder calling the trim at ligationsite.py python script, which can be found in the appendix.

```{bash, eval=FALSE}

!/bin/bash

cd /g/huber/projects/TCC/data/fastq if [ ! -d trimmed ]; then mkdir trimmedNeural lineage induction reveals multi-scale dynamics of 3D chromatin organisation fi for f in rawInput/C1HW8ACXX_E15_N14_13s002413S.fastq.gz; do echo python2.7 trim_at_ligationsite.py --fastq $f --out trimmed done

### Alignment
The trimmed sequences were aligned to the mm9 genome
(ftp://ftp.ensembl.org/pub/release-67/fasta/mus musculus/dna/Mus musculus.NCBIM37.67.dna.toplevel.fa.gz)
using Bowtie2 with the default settings (version 2.0.0-beta7, sensitive end-to-end alignment, 0 missmatches in seed
alignment, 22 nt seed substring, best alignment reported).
The following shell script does this for all files containing C1HW8ACXX E15 N14 13s002413*S in the trimmed folder
calling the bowtie2.

```{bash, eval=FALSE}
#!/bin/bash
cd /g/huber/projects/TCC/data/
for f in trimmed/C1HW8ACXX_E15_N14_13s002413*.fastq.gz; do
echo bowtie2 -x /path/to/bowtie2-index/NCBIM37.67 -U $f -S sam/bowtie2/ basename $f .fastq.gz .sam
done

Creation of fam files

The mm9 reference genome was cut in-silico using the restriction enzyme cutting sequences to generate a reference set of predicted restriction fragments. The reads were then mapped to these fragments. The following shell script does this for all files containing C1HW8ACXX E15 N14 13s002413*S in the sam/bowtie2 folder folder calling the createFAM.py python script. Description of the createFAM.py script It is a script that creates a FAM file after merging and filtering alignment data from two SAM files (ends of a read).

Usage:

python2.7-with-htseq createFAM.py filename 1.sam filename 2.sam [genomeName.fasta | genomeNameCuttingSeq.pckl] file 1.sam, file 2.sam - alignment output files; their filenames have to include the “ 1” and “ 2” strings (direction of sequencing) genomeName.fasta - FASTA file containing the genome sequence (with arbitrary filename extension) genomeNameCuttingSeq.pckl - PCKL file containing genome sequences cut with the restriction sequence. If not provided, the PCKL file will be created based on the above ‘genomeName.fasta’ file containing the genome sequence

Output:

The result of the createFam script is a FAM file whose each row contains information about one read pair, such as: read name (“rname”), first fragment’s chromosome name (“chr1”), first fragment’s alignment position (“apos1”), first fragment’s strand direction (“astrand1”), first fragment’s start position (“fstart1”), first fragment’s end position (“fend1”), second fragment’s chromosome name (“chr2”), second fragment’s alignment position (“apos2”), second fragment’s strand direction (“astrand2”), second fragment’s start position (“fstart2”), second fragment’s end position (“fend2”). The aforementioned labels in brackets are FAM file column titles included in its header.

Result:

The output FAM file (as well as genomeNameCuttingSeq.pckl if needed) is created in the same directory where createFAM.py exists. Progress and information how many reads were omitted are reported. A read is omitted if it is: Neural lineage induction reveals multi-scale dynamics of 3D chromatin organisation • not-aligned • badly aligned (score < 10) • overlapping restriction site (when overlap >6 bases)

Further preprocessing in R

set.seed(777)
options(digits=3, width=100)
options( prompt="> ", continue="+" )

## adjust this line to your own environment
test.dir = file.path("/g/huber/users/klaus/Data/Aleks/Test_Workflows/testVignette")
data.path = test.dir
setwd(test.dir)
source("functions.R")
library("qvalue")
library("vioplot")
library("DESeq")
library("IRanges")
library("locfit")
library("Matrix")
library("Biobase")
library("biomaRt")
library("RColorBrewer")
library("vioplot")
library("geneplotter")
library("genefilter")
library("Rsamtools")

References

AnnikaGable/chromium documentation built on May 5, 2019, 6:04 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

AnnikaGable/chromium
Analyze and visualize high throughput chromatin interaction data, such as Hi-C or ChIA-PET

In AnnikaGable/chromium: Analyze and visualize high throughput chromatin interaction data, such as Hi-C or ChIA-PET

Import

Using input data of the `InteractionSet` class

Using input data generated by the pre-processing scripts from the Appendix

Using toy example data files

Inspecting the InteractionSet object

Subsetting the InteractionSet

Binning

Normalization

Visualization

Smoothing of the interaction matrices

Appendix

Preprocessing of raw sequencing reads with Python

Trimming at ligation site

!/bin/bash

Creation of fam files

Usage:

Output:

Result:

Further preprocessing in R

References

R Package Documentation

Browse R Packages

We want your feedback!

AnnikaGable/chromium Analyze and visualize high throughput chromatin interaction data, such as Hi-C or ChIA-PET

In AnnikaGable/chromium: Analyze and visualize high throughput chromatin interaction data, such as Hi-C or ChIA-PET

Import

Using input data of the InteractionSet class

Using input data generated by the pre-processing scripts from the Appendix

Using toy example data files

Inspecting the InteractionSet object

Subsetting the InteractionSet

Binning

Normalization

Visualization

Smoothing of the interaction matrices

Appendix

Preprocessing of raw sequencing reads with Python

Trimming at ligation site

!/bin/bash

Creation of fam files

Usage:

Output:

Result:

Further preprocessing in R

References

R Package Documentation

Browse R Packages

We want your feedback!

AnnikaGable/chromium
Analyze and visualize high throughput chromatin interaction data, such as Hi-C or ChIA-PET

Using input data of the `InteractionSet` class