chiaSim: chiaSim

View source: R/ChIASim.R

chiaSimR Documentation

chiaSim

Description

This function simulates a set of data of chromosomal 3D interactions mediated by a specific protein which meant to mimicing the output data of ChIA-PET or Hi-ChIP.

Usage

chiaSim(TFBSfile="ERa", TSSfile=NULL, mean.frag.length = 250, 
	n.cells = 1E3, lp = 0.8, seqlength = 50, N.E=100, N.P=100, N.nE=100, N.nP=100,
	REp = "AAGCTT", mc.cores.user = 1, beta = 1, gmclass = 1, 
	outputformat= "base", seed = NULL)

Arguments

TFBSfile

TFBSfile is a dataframe containing TFBS information for a specific protein of interest. The first three columns provide the chromosome, start site, and end site of each TFBS for this protein, respectively. Package has two buil-in two TFBS datasets, "ERa" and "POL2". If the user would like to use different protein, then a user-supplied TFBS file with the required three-column format and infomation are needed.

TSSfile

TSSfile is a dataframe containing the transcription start site (TSS) information. Users do not need to provide TSS file as the package have the TSS information of hg19 that obtained from Refseq (Pruitt et al, 2014). However if users do provide a TSS file, it should follow the format as show in the following example.

mean.frag.length

mean.frag.lengthis the mean chromosome fragment length for sonication simulation. To simulate the sonication step in a real ChIA-PET experiment that cuts choromosomes into pieces, we need to specify the average length of fragments. The default is set to be 250bp.

n.cells

n.cells is the simulated number of cells which may be interpreted as the sequencing depth. By adjusting n.cells, different numbers of pairs can result. The default is set to be 10^5.

lp

The ligation probability. The probability of two segments to ligate. It is set to 0.8 by default.

seqlength

seqlength is the length of the fragments to be seqenced. ChIP-PET seqences the two end segments of an interacting pair. The ideal length of PET seqencing is 50-100bp (Fullwood et al, 2009), hence it is set to be 50bp for the default seqencing length.

N.E

is the numbers of TFBS sites randomly selected ($N_E$) from the provided TFBS file for forming true pairs for simulation. The selected TFBS sites are allocated to each choromosome proportional to their lengths. It is set to 400 by default.

N.P

is the numbers of TSS sites randomly selected ($N_P$) from the provided TSS file for forming true pairs for simulation. The selected TSS sites are allocated to each choromosome proportional to its lengths. It is set to 400 by default.

N.nE

is the numbers of TFBS sites randomly selected ($N_\overlineE$) across the whole genome for forming the false pairs for simulation. The selected TFBS sites are allocated to each choromosome proportional to their lengths. It is set to 400 by default.

N.nP

is the numbers of TSS sites randomly selected ($N_\overlineP$) across the whole genome for forming the false pairs for simulation. The selected TSS sites are allocated to each choromosome proportional to their lengths. It is set to 400 by default.

REp

Restriction Enzyme Sequence pattern. Default is set as "AAGCTT", the pattern of hindIII.

mc.cores.user

The number of cores used to do parallel computing when multi cores are available at the user's end.

beta

beta is the power parameter. Power parameter is the power based on the widely believed power law between interaction frequency and 1/log(genomic distance). It is usually set to 1 for human (Fudenberg and Mirny, 2012) and set as the default value.

gmclass

gmclass is a flagging variable on whether to identify self-ligation pairs. When it is set to 2, all resulting interaction pairs will be clusted into 2 groups according to their genomic distances; the cluster with smaller genomic distance will be treated as self-ligation and removed. When self-ligation is not a major issue, it is set to 1 which is also the default.

outputformat

outputformat is the format of output simulated data. Users can choose either "base" or "BEDPE".

seed

The seed number. It may be set for reproducibility purpose. Otherwise, it should be set as null.

Value

A simulated ChIA-PET data set.

References

Mumbach, M., Rubin, A., Flynn, R. et al. HiChIP: efficient and sensitive analysis of protein-directed genome architecture. Nat Methods 13, 919-922 (2016).

Fudenberg, Geoffrey, and Leonid A Mirny. "Higher-order chromatin structure: bridging physics and biology." Current opinion in genetics & development vol. 22,2 (2012). Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, DiCuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. doi: 10.1093/nar/gkt1114. Epub 2013 Nov 19. PMID: 24259432; PMCID: PMC3965018.

Pruitt, K. D., G. R. Brown, S. M. Hiatt, F. Thibaud-Nissen, A. Astashyn, O. Ermolaeva, C. M. Farrell, J. Hart, M. J. Landrum, K. M. McGarvey, M. R. Murphy, N. A. O'Leary, S. Pujar, B. Rajput, S. H. Rangwala, L. D. Riddick, A. Shkeda, H. Sun, P. Tamez, R. E. Tully, C. Wallin, D. Webb, J. Weber, W. Wu, M. DiCuccio, P. Kitts, D. R. Maglott, T. D. Murphy and J. M. Ostell (2014): "Refseq: an update on mammalian reference sequences," Nucleic Acids Res., 42, D756-D763.

Fullwood MJ, Wei CL, Liu ET, Ruan Y. Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 2009;19(4):521-532. doi:10.1101/gr.074906.107

Examples

data("POL2")
data("TSS")
output.base<- chiaSim(n.cells = 1E3, N.E=100, N.P=100, N.nE=100, N.nP=100, mc.cores.user = 48)
head(output.base)

sy-lou/ChIASim documentation built on April 15, 2023, 11 p.m.