knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

Many statistical methods and tools have been developed to analyze ChIA-PET, PLAC-seq and Hi-ChIP data over the past decade. While each new method typically claims superiority than existing ones, accessment of fairness in such studies is not easy without common benchmarking data sets.

The goal of this package is to provide a mean to integrate benchomarking data sets for fair comarison among studies. In this package, simulated data are generated by imitating biological experimental procedures. The simulation algorithm is designed by mimicking the major steps of the ChIA-PET experiment. Therefore, objective data sets without bias for any of the methods can be generated and used to compare the performance of methods.

Both ChIA-PET and Hi-ChIP are designed to detect chromatin interactions, specifically enhancer-promoter interactions, where a promoter is marked by a transcription starting site (TSS) and a enhancer region is marked by a transcrition factor binding site (TFBS). Although these two technologies differ in efficiency of cell usage (Mumbach et al, 2016), the resulted data contain the same information. Hence, we use ChIA-PET as an example, although the data generated may be considered as an in silico counterpart of either experimental protocol. Specifically, our in silico procedure consists of the following five steps.

Step 1: Crosslinking. To imitate this step in ChIA-PET, TFBS and TSS sites are randomly selected from the database with the numbers of sites allocated to each chromosome set to be the same proportion as that in the data base. To induce random collision, we also randomly selected sites across the whole genome that neither are actual TFBS or TSS, i.e., non-specific, with the number of sites selected in each chromosome also proportional to chromosome size. Then these selected TFBS sites (regardless of known or non-specific) are paired with selected TSS sites (regardless of known or non-specific). Depends on users' interest, different proteins can be chosen such that selected TFBS are associated with the protein of choice. Those pairs with both end from actual TFBS or TSS are labeled as true pair, and labeled as false pairs other wise.

Step 2: Restriction enzyme digestion. Find all segment pattern of certain restriction enzyme, use these segments as cut sites to break chromosome into pieces.

Step 3: Ligation. These pieces are paired with one TFBS piece matching one TSS piece and forming a loop by connecting the ends.

Step 4: Sonication. Poisson process is applied to the loops sampled in the previous steps, and the loops are cut into segments; and the segments containing the connecting sites are kept.

Step 5: Paired-end Sequencing: Reads of 50 bps from the two ends of the kept segments are obtained and then mapped to the reference genome.

This package contains the following two functions:

Installation

ChIASim can be installed from GitHub and installed by devtools::install_git("https://github.com/sl-lin/ChIASim")

ChIASim depends on other packages, i.e., parallel, BSgenome.Hsapiens.UCSC.hg19, and GenomicFeatures. They need to be installed, one may use the commands below, in advance to running ChIASim.

if (!requireNamespace("parallel", quietly = TRUE)) install.packages("parallel")

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")

if (!requireNamespace("BSgenome.Hsapiens.UCSC.hg19", quietly = TRUE)) BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")

if (!requireNamespace("GenomicFeatures", quietly = TRUE)) BiocManager::install("GenomicFeatures")

ChIASim for data generation

$chiaSim(TFBSfile="POL2", TSSfile=NULL, mean.frag.length = 250, n.cells = 1E5,lp = 0.8, seqlength = 50, N.E=400, N.P=400, N.nE=400, N.nP=400, REp = "AAGCTT", mc.cores.user = 24, beta = 1, gmclass = 1, outputformat= "base", seed=NULL)$

Input data format and other parameters for chiaSim

$\textbf{TFBSfile}$ is a dataframe containing TFBS information for a specific protein of interest. The first three columns provide the chromosome, start site, and end site of each TFBS for this protein, respectively. Package has two buil-in two TFBS datasets, "ERa" and "POL2". If the user would like to use different protein, then a user-supplied TFBS file with the required three-column format and infomation are needed.

Here is the build-in TFBS file for POL2:

library(ChIASim)
data("POL2")
head(POL2)

$\textbf{TSSfile}$ is a dataframe containing the transcription start site (TSS) information. Users do not need to provide TSS file as the package have the TSS information of hg19 that obtained from Refseq (Pruitt et al, 2014). However if users do provide a TSS file, it should follow the format as show in the following example.

data("TSS")
head(TSS)

In this example, the first column is the gene ID for each TSS. The second column denotes the chromosome in which the gene is located. The third column is strand orientation: + denote forward and - denote reverse. The forth column and the fifth column are the TSS for the forward and reverse strands, respectively. So if a read is from a forward strand, then the TSS is the number in the forth column; otherwise, the TSS is the number in the fifth column. The sixth column is the gene name.

$\textbf{mean.frag.length}$ is the mean chromosome fragment length for sonication simulation. To simulate the sonication step in a real ChIA-PET experiment that cuts choromosomes into pieces, we need to specify the average length of fragments. The default is set to be 250bp.

$\textbf{n.cells}$ is the simulated number of cells which may be interpreted as the sequencing depth. By adjusting n.cells, different numbers of pairs can result. The default is set to be 10^5.

$\textbf{lp}$ is the ligation probability. The probability of two segments to ligate. It is set to 0.8 by default.

$\textbf{seqlength}$ is the length of the fragments to be seqenced. ChIP-PET seqences the two end segments of an interacting pair. The ideal length of PET seqencing is 50-100bp (Fullwood et al, 2009), hence it is set to be 50bp for the default seqencing length.

$\textbf{N.E}$ is the numbers of TFBS sites randomly selected ($N_E$) from the provided TFBS file for forming true pairs for simulation. The selected TFBS sites are allocated to each choromosome proportional to their lengths. It is set to 400 by default.

$\textbf{N.P}$ is the numbers of TSS sites randomly selected ($N_P$) from the provided TSS file for forming true pairs for simulation. The selected TSS sites are allocated to each choromosome proportional to its lengths. It is set to 400 by default.

$\textbf{N.nE}$ is the numbers of TFBS sites randomly selected ($N_{\overline{E}}$) across the whole genome for forming the false pairs for simulation. The selected TFBS sites are allocated to each choromosome proportional to their lengths. It is set to 400 by default.

$\textbf{N.nP}$ is the numbers of TSS sites randomly selected ($N_{\overline{P}}$) across the whole genome for forming the false pairs for simulation. The selected TSS sites are allocated to each choromosome proportional to their lengths. It is set to 400 by default.

$\textbf{REp}$ is the pattern of restricted enzyme. It is set to AAGCTT by default, which is the pattern of HindIII.

$\textbf{mc.cores.user}$ is the number of cores used to do parallel computing when multi cores are available at the user's end.

$\textbf{beta}$ is the power parameter. Power parameter is the power based on the widely believed power law between interaction frequency and 1/log(genomic distance). It is usually set to 1 for human (Fudenberg and Mirny, 2012) and set as the default value.

$\textbf{gmclass}$ is a flagging variable on whether to identify self-ligation pairs. When it is set to 2, all resulting interaction pairs will be clusted into 2 groups according to their genomic distances; the cluster with smaller genomic distance will be treated as self-ligation and removed. When self-ligation is not a major issue, it is set to 1 which is also the default.

$\textbf{outputformat}$ is the format to output the simulated data. Users can choose either "base" or "BEDPE". We provide these two formats becasue they are the two most popular data formats for real data. For example, a K562 CTCF ChIA-PET data with accession number GSM970216 uses the BEDPE format, while a K562 POL2 ChIA-PET data with accession number GSM832465 uses a 7-column base format. Defualt set to be base format. We provide two examples in the following, one with the base format and the other with the BEDPE format.

$\textbf{seed}$ The seed number. It may be set for reproducibility purpose. Otherwise, it should be set as null.

#The parameters are set to be smaller than default for a quick illustration.
#base format
output.base<- chiaSim(n.cells = 500, N.E=100, N.P=100, N.nE=100, N.nP=100,  mc.cores.user = 1)
head(output.base)

The first seven columns are the 7-column base format, which provide the location information for two anchors of an interacting pair plus the interaction count. LT column is the ligation type, which tells the category of each pair, TL (true high), TL (true low) and F (false). MC column is the margincal counts which is the sum of counts of two anchors of a pair interacting with with all anchors in the data set. Mdist tells the minimum distance of each pair to their closes pair of TFBS and TSS. Gdist is the genomic distance between the midpoints of two interacting anchors.

#BEDPE format
output.BEDPE<- chiaSim(n.cells = 500, N.E=100, N.P=100, N.nE=100, N.nP=100, mc.cores.user = 1, outputformat="BEDPE")
head(output.BEDPE[[1]])
head(output.BEDPE[[2]])

The first element in the output list, $output.BEDPE[[1]]$, show the BEDPE data format, which provide the location information for two anchors of an interacting pair plus the interaction count. The second element in the output list, $output.BEDPE[[2]]$, are the information provide from simulation procedure, i.e., LT, MC, Mdist and Gdist, containing the information as described above.

Conversion between the two data formats

$\textbf{convertfmt}$: This function allows user to convert the output data from one format to the onther format without having to re-run the whole simulation program.

$convertfmt(input, inputformat)$

$\textbf{input}$ is the dataset which needs to be converted to a different format.

$\textbf{inputformat}$ is the format of the data set to be converted. It can be either "base" or "BEDPE". If it is "base", it will be conveted to BEDPE, vice versa.

#Converting base to BEDPE
output<- convertfmt(input= output.base, inputformat="base")
head(output[[1]])
head(output[[2]])
#Converting BEDPE to base
output<- convertfmt(input= output.BEDPE, inputformat="BEDPE")
head(output)

References

Mumbach, M., Rubin, A., Flynn, R. et al. HiChIP: efficient and sensitive analysis of protein-directed genome architecture. Nat Methods 13, 919-922 (2016).

Fudenberg, Geoffrey, and Leonid A Mirny. "Higher-order chromatin structure: bridging physics and biology." Current opinion in genetics & development vol. 22,2 (2012).

Pruitt, K. D., G. R. Brown, S. M. Hiatt, F. Thibaud-Nissen, A. Astashyn, O. Ermolaeva, C. M. Farrell, J. Hart, M. J. Landrum, K. M. McGarvey, M. R. Murphy, N. A. O'Leary, S. Pujar, B. Rajput, S. H. Rangwala, L. D. Riddick, A. Shkeda, H. Sun, P. Tamez, R. E. Tully, C. Wallin, D. Webb, J. Weber, W. Wu, M. DiCuccio, P. Kitts, D. R. Maglott, T. D. Murphy and J. M. Ostell (2014): "Refseq: an update on mammalian reference sequences," Nucleic Acids Res., 42, D756-D763.

Fullwood MJ, Wei CL, Liu ET, Ruan Y. Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 2009;19(4):521-532. doi:10.1101/gr.074906.107



sy-lou/ChIASim documentation built on April 15, 2023, 11 p.m.