MACPETUlt: Paired-end Tag (PET) Analysis Function.
In MACPET: Model based analysis for paired-end data

Description Usage Arguments Details Value Stages description Parallel Author(s) References See Also Examples

MACPETUlt is used for running analysis based on paired-end DNA data, including stages for linker removal, mapping to the reference genome, PET classification, binding site identification and interaction analysis.

MACPETUlt(SA_AnalysisDir = "", SA_stages = c(0:3),
  SA_prefix = "MACPET", S0_fastq1 = "", S0_fastq2 = "",
  S0_LinkerA = "GTTGGATAAG", S0_LinkerB = "GTTGGAATGT",
  S0_MinReadLength = 18, S0_MaxReadLength = 50,
  S0_LinkerOccurence = 0, S0_image = TRUE, S0_fastqStream = 2e+06,
  S1_fastq1_usable_dir = "", S1_fastq2_usable_dir = "",
  S1_image = TRUE, S1_BAMStream = 2e+06, S1_makeSam = TRUE,
  S1_genome = "hg19", S1_RbowtieIndexBuild = FALSE,
  S1_RbowtieIndexDir = "", S1_RbowtieIndexPrefix = "",
  S1_RbowtieRefDir = "", S2_PairedEndBAMpath = "", S2_image = TRUE,
  S2_BlackList = TRUE, S3_fileSelfDir = "", S3_image = TRUE,
  S3_method = "BH", S4_filePSFitDir = "", S4_filePIntraDir = "",
  S4_filePInterDir = NULL, S4_FDR_peak = 0.1, S4_method = "BH",
  S4_image = TRUE, S4_minPETs = 2, S4_PeakExt = 500)

`SA_AnalysisDir`	A directory were all the ouput is to be saved. This parameter is mandatory for every stage.
`SA_stages`	Numeric vector or integer (if stages are run separately). This parameter is mandatory for every stage (see details).
`SA_prefix`	A string which is going to be used as prefix for the outputs (default: 'MACPET'). This parameter is mandatory for every stage.
`S0_fastq1`	A string with the directory of the 5-end fastq (or fastq.gz) file. This parameter is mandatory if Stage 0 is run (see details).
`S0_fastq2`	A string with the directory of the 3-end fastq (or fastq.gz) file. This parameter is mandatory if Stage 0 is run (see details).
`S0_LinkerA`	A string with the first linker sequence (default 'GTTGGATAAG'). This parameter is mandatory if Stage 0 is run (see details).
`S0_LinkerB`	A string with the second linker sequence (default 'GTTGGAATGT'). This parameter is mandatory if Stage 0 is run (see details).
`S0_MinReadLength`	A positive integer with the minimum read length after linker trimming (default: 18). This parameter is mandatory if Stage 0 is run (see details).
`S0_MaxReadLength`	A positive integer with the maximum read length after linker trimming (default: 50). This parameter is mandatory if Stage 0 is run (see details).
`S0_LinkerOccurence`	One of the following: 0, 1, 2, 3, 4. This parameter defines the linker-occurence mode (see details). Default 0.
`S0_image`	Logical, indicating if a pie-chart image for the fastq files classification will be produced (default=TRUE). This parameter is mandatory if Stage 0 is run.
`S0_fastqStream`	Positive integer for total lines of fastq files to be loaded in R (best to leave it at default because it might cause memory crash). This parameter is mandatory if Stage 0 is run.
`S1_fastq1_usable_dir`	String with the directory of the 5-end usable fastq (or fastq.gz) files. This parameter might not be mandatory (see details).
`S1_fastq2_usable_dir`	String with the directory of the 3-end usable fastq (or fastq.gz) files. This parameter might not be mandatory (see details).
`S1_image`	Logical indicating if images for the mapping percentage and the pairing percentage will be produced (default=TRUE). This parameter is mandatory if Stage 1 is run.
`S1_BAMStream`	Positive integer for the total number of bam file lines to be loaded in R in a loop for pairing (best to leave it at default because it might cause memory crash). This parameter is mandatory if Stage 1 is run.
`S1_makeSam`	Logical indicating whether the resulted paired-end BAM file will be splitted to two SAM files (one for each read). The output SAM files can be used as input in the MANGO algorithm (default=TRUE). Note, that the user has to remove the SAM header before running MANGO. This parameter is mandatory if Stage 1 is run.
`S1_genome`	String with the genome to be used in the bam file header (default='hg19'). This parameter is mandatory if Stage 1 is run (see details).
`S1_RbowtieIndexBuild`	Logical indicating whether you want to build the bowtie index or not (default=FALSE). This parameter is mandatory if Stage 1 is run (see details).
`S1_RbowtieIndexDir`	String with the directory of the bowtie index (if `S1_RbowtieIndexBuild==FALSE`) or with the directory where the bowtie index will be saved (if `S1_RbowtieIndexBuild==TRUE`). This parameter is mandatory if Stage 1 is run (see details).
`S1_RbowtieIndexPrefix`	String with the prefix for the bowtie indeces in `S1_RbowtieIndexDir` (see details). This parameter is mandatory if Stage 1 is run (see details).
`S1_RbowtieRefDir`	A vector with the directories of the .fa files, used if `S1_RbowtieIndexBuild==TRUE`. This parameter is mandatory if Stage 1 is run and `S1_RbowtieIndexBuild==TRUE` (see details).
`S2_PairedEndBAMpath`	A string with the directory of the paired-end bam file (or paired-end sam file). This parameter might not be mandatory (see details).
`S2_image`	Logical indicating whether images for the s elf-ligated/intra-chromosomal cut-off as well as pie-charts for the PET classification will be produced (default=TRUE). This parameter is mandatory if Stage 1 is run.
`S2_BlackList`	Logical indicating whether black-listed regions will be removed from the data based on the `S1_genome` parameter (see details). Alternatively a `GRanges` object with the user specified regions. This parameter is mandatory if Stage 2 is run.
`S3_fileSelfDir`	A string with the directory of the of the object of class `PSelf`. This parameter might not be mandatory (see details).
`S3_image`	Logical indicating whether images for the binding site's FDR, sizes of the binding sites, sizes of binding site's upstream/downstream peaks will be created. This parameter is mandatory if Stage 3 is run.
`S3_method`	String with the FDR method used for finding p-values of significant peaks in the data. See `p.adjust.methods` (default= 'BH'). This parameter is mandatory if Stage 3 is run.
`S4_filePSFitDir`	A string with the directory of the object of class `PSFit`. This parameter is not mandatory if Stage 3 is run right before Stage 4.
`S4_filePIntraDir`	A string with the directory of the object of class `PIntra`. This parameter is not mandatory if Stages 2:3 are run right before Stage 4.
`S4_filePInterDir`	A string with the directory of the object of class `PInter`. This parameter is not mandatory if Stages 2:3 are run right before Stage 4. NOTE: Currently, inter-chromosomal interactions are not supported, so this parameter is ignored.
`S4_FDR_peak`	A numeric for the FDR cut-off value of significant peaks to be used in interaction analysis (default 0.1 because MACPET is more strict in peak-calling than MACS, higher threshold than 0.1 might also be appropriate). This parameter is mandatory if Stage 4 is run.
`S4_method`	String with the FDR method used for finding p-values of significant interactions in the data. See `p.adjust.methods` (default= 'BH'). This parameter is mandatory if Stage 4 is run.
`S4_image`	Logical indicating whether images for the total peaks and total intra- and inter-chromosomal PETs used in the interaction analysis will be created. This parameter is mandatory if Stage 4 is run.
`S4_minPETs`	The minimum total PETs allowed between every two peaks (default=2). This parameter is mandatory if Stage 4 is run.
`S4_PeakExt`	Integer for the number of bp to extend each peak on either side (default 500). This parameter will both affect the interaction PETs falling into the peak and the merging of overlapping peaks. This parameter is mandatory if Stage 4 is run.

Every stage has parameters associated with it. Parameters with prefix SA correspond to all stages, S0 to Stage 0, S1 to Stage 1 etc. Parameters with SA prefix are mandatory for every stage.

If SA_stages parameter is given as vector, then the vector has to be continuous, that is for example c(0:3) or c(2:3), not c(0,2,3). In general the best practice is to run all the stages at once.

The fastq files in S0_fastq1 and S0_fastq2 have to be of same length and be sorted by their ID. Furthermore, the IDs in S0_fastq1 have to end with /1 and the ones in S0_fastq2 with /2, representing the 5- and 3-end tags respectively. In other words, for the same line in S0_fastq1 and S0_fastq2, their IDs have to be identical, except form their suffixes /1 and /2 respectively. Moreover, the '/' symbol can be replaced with any other symbol, this will not cause any problems.

S0_LinkerOccurence parameter defines the linker-occurence mode and separates the usable from the ambiguous PETs. PETs with both reads including linkers are not affeted by S0_LinkerOccurence. Also, reads which do not meet the S0_MaxReadLength/S0_MinReadLength lengths, are moved to ambiguous anyway. The four values of S0_LinkerOccurence are:

Mode 0:: Both reads have to include a linker in order to be checked as usable or chimeric, if they dont, they are moved to ambiguous.
Mode 1:: If read 1 is not matching any linker, but read 2 does, then the PET will be moved to usable.
Mode 2:: If read 2 is not matching any linker, but read 1 does, then the PET will be moved to usable.
Mode 3:: If any of the reads does not match any linker then the PET they will be moved to usable.
Mode 4:: If both reads do not match any of the linkers, then the PET will be moved to usable.

S0_MaxReadLength has to be greater than S0_MinReadLength. The user should leave those two at default unless the PET data is produced by tagmentation.

S1_fastq1_usable_dir and S1_fastq2_usable_dir are not mandatory if Stage 0 is run right before Stage 1 (SA_stages=c(0,1)). Those two are only mandatory if Stage 1 is run separately. Then those parameters assume to have the usable reads only. The same fastq specifications apply as those for S0_fastq1 and S0_fastq2.

The parameter S1_genome is very important. First the genome name given in S1_genome should be the same as the one used for building the bowtie index for mapping. This parameter will add an 'AS' column to the paired-end bam file with the genome information. In Stage 2, this header will be used for identifying which kind on black-listed regions to use if S2_BlackList==TRUE.

If S1_RbowtieIndexBuild==FALSE then the bowtie index is assumed to be already built and saved in S1_RbowtieIndexDir. Then the S1_RbowtieIndexDir folder should include the following files: S1_RbowtieIndexPrefix.1.ebwt, S1_RbowtieIndexPrefix.2.ebwt, S1_RbowtieIndexPrefix.3.ebwt, S1_RbowtieIndexPrefix.4.ebwt, S1_RbowtieIndexPrefix.rev.1.ebwt and S1_RbowtieIndexPrefix.rev.2.ebwt, or with .ebwtl. Where S1_RbowtieIndexPrefix is also given as input.

If S1_RbowtieIndexBuild==TRUE then the bowtie index will be build using the bowtie_build function. This function will need the .fa files which should be given as input in the S1_RbowtieRefDir vector. This is a character vector with the directories of the .fa files to use. The output index will be saved in S1_RbowtieIndexDir. if S1_RbowtieIndexBuild==FALSE then S1_RbowtieRefDir can be an empty string.

The parameter S2_PairedEndBAMpath has to be specified only if Stage 2 is run without running Stage 1 right before (SA_Stages=c(2) or c(2,3), not c(1,2) or c(0,1,2) for example). If this is the case, the S2_PairedEndBAMpath has to be the path to the BAM/SAM paired-end file. The file has to include the header with the 'SN', 'LN' and 'AS' columns. Moreover the mate flags of the file have to be correct and also the duplicated PETs must be flagged too. Stage 2 will upload the whole data in R using readGAlignmentPairs function with flags isDuplicate=FALSE and isPaired=TRUE. So if duplicated PETs are not flagged, they will be used in the analysis. If the previous stages are run in sequence, then S2_PairedEndBAMpath will be overwritten with the newly created BAM file, which will have the correct flags.

If S2_BlackList==TRUE then which genome black-list is going to be used is decided by the 'AS' column in the S2_PairedEndBAMpath file, which is specified by the S1_genome if Stage 1 is also run. The black-listed regions cover the following genomes: 'hg19', 'ce10', 'dm3', 'hg38', 'mm9', 'mm10'. If the 'AS' header column is missing from the S2_PairedEndBAMpath file, or if the S1_genome is not matching any of the above named genomes, then a warning will be produced saying that no black-listed regions will be removed. Alternatily, the user can provide its own black-listed regions as a GRanges object.

The parameter S3_fileSelfDir is not mandatory if the stages are run in sequence, if Stage 2 is run right before stage 3. If this is the case then S3_fileSelfDir will be overwritten with the data produced in Stage 2. If Stage 3 is run separately, then S3_fileSelfDir has to be provided. It should be a PSelf object and both the name of the object in the directory and the one uploaded in R should be SA_prefix_pselfData.

All outputs are saved at the SA_AnalysisDir. The output depents of the stages run:

Stage 0: (outputs saved in a folder named S0_results in SA_AnalysisDir)

SA_prefix_usable_1.fastq.gz:: fastq.gz files with the usable 5-end tags. To be used in Stage 1.
SA_prefix_usable_2.fastq.gz:: fastq.gz files with the usable 3-end tags. To be used in Stage 1.
SA_prefix_chimeric_1.fastq.gz:: fastq.gz files with the chimeric 5-end tags.
SA_prefix_chimeric_2.fastq.gz:: fastq.gz files with the chimeric 3-end tags.
SA_prefix_ambiguous_1.fastq.gz:: fastq.gz files with the ambiguous 5-end tags.
SA_prefix_ambiguous_2.fastq.gz:: fastq.gz files with the ambiguous 3-end tags.
SA_prefix_stage_0_image.jpg:: Pie chart image with the split of two fastq files used as input (if S0_image==TRUE).

Stage 1: (outputs saved in a folder named S1_results in SA_AnalysisDir)

SA_prefix_usable_1.sam:: sam file with the mapped 5-end reads (if S1_makeSam==FALSE).
SA_prefix_usable_2.sam:: sam file with the mapped 3-end reads (if S1_makeSam==FALSE).
SA_prefix_Paired_end.bam:: paired-end bam file with the mapped PETs. To be used in Stage 2
SA_prefix_Paired_end.bam.bai:: .bai file for SA_prefix_Paired_end.bam. To be used in Stage 2.
SA_prefix_stage_1_p1_image.jpg:: Pie-chart for the mapped/unmapped reads from SA_prefix_usable_1.sam, SA_prefix_usable_2.sam (if S1_image==TRUE).
SA_prefix_stage_1_p2_image.jpg:: Pie-chart for the paired/unpaired reads of SA_prefix_Paired_end.bam (if S1_image==TRUE).

Stage 2: (outputs saved in a folder named S2_results in SA_AnalysisDir)

SA_prefix_pselfData:: An object of PSelf class with the Self-ligated PETs. To be used in Stage 3.
SA_prefix_pintraData:: An object of PIntra class with the Intra-chromosomal PETs.
SA_prefix_pinterData:: An object of PInter class with the Inter-chromosomal PETs.
SA_prefix_stage_2_p1_image.jpg:: Pie-chart reliable/dublicated/black-listed PETs of SA_prefix_Paired_end.bam (if S2_image==TRUE).
SA_prefix_stage_2_p2_image.jpg:: Histogram with the self-ligated/intra-chromosomal cut-off for SA_prefix_Paired_end.bam (if S2_image==TRUE).
SA_prefix_stage_2_p3_image.jpg:: Pie-chart for the self-ligated/intra-chromosomal/inter-chromosomal PETs of SA_prefix_Paired_end.bam (if S2_image==TRUE).

Stage 3: (outputs saved in a folder named S3_results in SA_AnalysisDir)

SA_prefix_psfitData:: An object of PSFit class with the peak information.
SA_prefix_stage_3_p1_image.jpg:: Sizes of the upstream vs downstream peaks of each binding site given the binding site's FDR (if S3_image==TRUE).
SA_prefix_stage_3_p2_image.jpg:: FDR of the binding sites. The horizontal red line is at FDR=0.05 (if S3_image==TRUE).
SA_prefix_stage_3_p3_image.jpg:: Comparison of binding site sizes given their FDR (if S3_image==TRUE).
SA_prefix_stage_3_p3_image.jpg:: FDR for the upstream/donwstream peaks of the binding sites given the binding sites FDR (if S3_image==TRUE).

Stage 4: (outputs saved in a folder named S4_results in SA_AnalysisDir)

SA_prefix_GenomeMapData:: An object of GenomeMap class with the interactions information.
SA_prefix_stage_4_p1_image.jpg:: Pie charts with the total peaks involved in the interactions and the total intra/inter-chromosomal PETs involved in the interactions (if S4_image==TRUE).

Stage 0:4 :

All the above outputs. Furthermore, a log file named SA_prefix_analysis.log is always created in SA_AnalysisDir with information about the process.

MACPETUlt runs a complete or partial analysis for PET data, depending on the stages of the analysis the user wants to run. The stages of the analysis are the following:

Stage 0:: Linker identification stage: This stage uses the two fastq files for the 5- and 3-end tags and identifies which tags contain any of the linkers. Based on the linker combinations it classifies the PETs as usable (linkers A/A or B/B), chimeric (linkers A/B or B/A) and ambiguous (linkers non/A, non/B, A/non, B/non unless chosen otherwise by S0_LinkerOccurence, or be smaller/bigger than the S0_MinReadLength/S0_MaxReadLength after the linker removal, respectively). Only usable PETs are considered in the subsequent steps.
Stage 1:: PET mapping stage: This stage uses the usable PETs identified by stage 0. It maps them separately to the reference genome using the bowtie function with no mismatch per read, and keeps the uniquely mapped reads only. It then maps the unmapped reads to the reference genome with at most one mismatch and keeps the uniquely mapped reads. Uniquely mapped reads with zero or one mismatch are then merged and paired, their duplicates are marked and a paired-end bam file is created which is used in State 2.
Stage 2:: PET classification stage: This stage takes the BAM paired-end file from stage 1 and classifies the PETs as: Inter-chromosomal PETs (which connect two different chromosomes), Intra-chromosomal PETs (which connect regions of the same chromosome) and Self-ligated PETs (which are used for binding site analysis). Self-ligated PETs are used for finding the protein binding sites (peaks), while Intra- and Inter-chromosomal are used for interactions between the peaks. The algorithm uses the elbow-method to seperate the Self-ligated from the Intra-chromosomal population. Note that loading the data into R might take a while depending on the size of the data.
Stage 3:: Peak calling stage: This stage uses the Self-ligated PETs and it runs the EM algorithm to find clusters which represent candidate peaks/binding sites in 2 dimentional space using skewed generalized students-t distributions (SGT). After the peak-calling analysis is done, the algorithm assesses the significance of the candidate peaks using a local Poisson model.
Stage 4:: Interaction analysis stage: This stage uses the peaks found in stage 3 and the intra-chromosomal PETs found in stage 2 to discover interactions between those peaks. Inter-chromosomal PETs are not supported yet. It first estimates the expected number of PETs in each interaction under H0 (everything is random), based on the depth of the peaks which interact and their distance. It then uses the Poisson distribution to call for significant interactions. The resulting object will include all potential interactions found in the data. Then the function GetSignInteractions can be used for subsetting significant interactions. Therefore, this stage needs only to be run once, then the FDR cut-off of the most significant interactions can be given multiple times in the GetSignInteractions function. This stages will estimate the interactions in circles. In each circle the most significant interactions will be added to the model (those with the lowest FDR). Then the distances between the peaks of the significant interactions will become 0 and the rest of the distances (for the interactions not added yet) will be adjusted based on that. Then the second circle will run and so on.

All stages can be run in parallel using the register function. The user has to register a parallel backhead before starting the function.

Ioannis Vardaxis, ioannis.vardaxis@ntnu.no

Vardaxis I, Drabl<c3><b8>s F, Rye M and Lindqvist BH (2018). MACPET: Model-based Analysis for ChIA-PET. To be published.

Vardaxis I, Drabl<c3><b8>s F, Rye M and Lindqvist BH (2018). MACPET: Complete pipeline for ChIA-PET. To be published.

Consortium EP (2012) An integrated encyclopedia of DNA elements in the human genome.. Nature, 489(7414), pp. 57<e2><80><93>74. http://dx.doi.org/10.1038/nature11247.

PSelf, PIntra, PInter, PSFit, GenomeMap, summary, AnalysisStatistics, plot BiocParallel, ConvertToPSelf, exportPeaks, TagsToGInteractions, PeaksToGRanges, PeaksToNarrowPeak, ConvertToPE_BAM, GetSignInteractions, GetShortestPath

#Create a temporary forder, or anywhere you want:
SA_AnalysisDir=file.path(tempdir(),'MACPETtest')
dir.create(SA_AnalysisDir)#where you will save the results
#give directory of the BAM file:
S2_PairedEndBAMpath=system.file('extdata', 'SampleChIAPETData.bam', package = 'MACPET')

#give prefix name:
SA_prefix='MACPET'

#parallel backhead can be created using the BiocParallel package
#parallel backhead can be created using the BiocParallel package
#requireNamespace('BiocParallel')
#snow <- BiocParallel::SnowParam(workers = 4, type = 'SOCK', progressbar=FALSE)
#BiocParallel::register(snow, default=TRUE)

#-run for the whole binding site analysis:
MACPETUlt(SA_AnalysisDir=SA_AnalysisDir,
          SA_stages=c(2:3),
          SA_prefix=SA_prefix,
          S2_PairedEndBAMpath=S2_PairedEndBAMpath,
          S2_image=TRUE,
          S2_BlackList=TRUE,
          S3_image=TRUE)



#load results:
SelfObject=paste(SA_prefix,'_pselfData',sep='')
load(file.path(SA_AnalysisDir,'S2_results',SelfObject))
SelfObject=get(SelfObject)
class(SelfObject) # see methods for this class

IntraObject=paste(SA_prefix,'_pintraData',sep='')
load(file.path(SA_AnalysisDir,'S2_results',IntraObject))
IntraObject=get(IntraObject)
class(IntraObject) # see methods for this class

InterObject=paste(SA_prefix,'_pinterData',sep='')
load(file.path(SA_AnalysisDir,'S2_results',InterObject))
InterObject=get(InterObject)
class(InterObject) # see methods for this class

SelfFitObject=paste(SA_prefix,'_psfitData',sep='')
load(file.path(SA_AnalysisDir,'S3_results',SelfFitObject))
SelfFitObject=get(SelfFitObject)
class(SelfFitObject) # see methods for this class

#-----delete test directory:
unlink(SA_AnalysisDir,recursive=TRUE)