Description Usage Arguments Details Value Stages description Parallel Author(s) References See Also Examples
MACPETUlt
is used for running analysis based on paired-end DNA data, including stages
for linker removal, mapping to the reference genome, PET classification, binding site identification and
interaction analysis.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | MACPETUlt(SA_AnalysisDir = "", SA_stages = c(0:3),
SA_prefix = "MACPET", S0_fastq1 = "", S0_fastq2 = "",
S0_LinkerA = "GTTGGATAAG", S0_LinkerB = "GTTGGAATGT",
S0_MinReadLength = 18, S0_MaxReadLength = 50,
S0_LinkerOccurence = 0, S0_image = TRUE, S0_fastqStream = 2e+06,
S1_fastq1_usable_dir = "", S1_fastq2_usable_dir = "",
S1_image = TRUE, S1_BAMStream = 2e+06, S1_makeSam = TRUE,
S1_genome = "hg19", S1_RbowtieIndexBuild = FALSE,
S1_RbowtieIndexDir = "", S1_RbowtieIndexPrefix = "",
S1_RbowtieRefDir = "", S2_PairedEndBAMpath = "", S2_image = TRUE,
S2_BlackList = TRUE, S3_fileSelfDir = "", S3_image = TRUE,
S3_method = "BH", S4_filePSFitDir = "", S4_filePIntraDir = "",
S4_filePInterDir = NULL, S4_FDR_peak = 0.1, S4_method = "BH",
S4_image = TRUE, S4_minPETs = 2, S4_PeakExt = 500)
|
SA_AnalysisDir |
A directory were all the ouput is to be saved. This parameter is mandatory for every stage. |
SA_stages |
Numeric vector or integer (if stages are run separately). This parameter is mandatory for every stage (see details). |
SA_prefix |
A string which is going to be used as prefix for the outputs (default: 'MACPET'). This parameter is mandatory for every stage. |
S0_fastq1 |
A string with the directory of the 5-end fastq (or fastq.gz) file. This parameter is mandatory if Stage 0 is run (see details). |
S0_fastq2 |
A string with the directory of the 3-end fastq (or fastq.gz) file. This parameter is mandatory if Stage 0 is run (see details). |
S0_LinkerA |
A string with the first linker sequence (default 'GTTGGATAAG'). This parameter is mandatory if Stage 0 is run (see details). |
S0_LinkerB |
A string with the second linker sequence (default 'GTTGGAATGT'). This parameter is mandatory if Stage 0 is run (see details). |
S0_MinReadLength |
A positive integer with the minimum read length after linker trimming (default: 18). This parameter is mandatory if Stage 0 is run (see details). |
S0_MaxReadLength |
A positive integer with the maximum read length after linker trimming (default: 50). This parameter is mandatory if Stage 0 is run (see details). |
S0_LinkerOccurence |
One of the following: 0, 1, 2, 3, 4. This parameter defines the linker-occurence mode (see details). Default 0. |
S0_image |
Logical, indicating if a pie-chart image for the fastq files classification will be produced (default=TRUE). This parameter is mandatory if Stage 0 is run. |
S0_fastqStream |
Positive integer for total lines of fastq files to be loaded in R (best to leave it at default because it might cause memory crash). This parameter is mandatory if Stage 0 is run. |
S1_fastq1_usable_dir |
String with the directory of the 5-end usable fastq (or fastq.gz) files. This parameter might not be mandatory (see details). |
S1_fastq2_usable_dir |
String with the directory of the 3-end usable fastq (or fastq.gz) files. This parameter might not be mandatory (see details). |
S1_image |
Logical indicating if images for the mapping percentage and the pairing percentage will be produced (default=TRUE). This parameter is mandatory if Stage 1 is run. |
S1_BAMStream |
Positive integer for the total number of bam file lines to be loaded in R in a loop for pairing (best to leave it at default because it might cause memory crash). This parameter is mandatory if Stage 1 is run. |
S1_makeSam |
Logical indicating whether the resulted paired-end BAM file will be splitted to two SAM files (one for each read). The output SAM files can be used as input in the MANGO algorithm (default=TRUE). Note, that the user has to remove the SAM header before running MANGO. This parameter is mandatory if Stage 1 is run. |
S1_genome |
String with the genome to be used in the bam file header (default='hg19'). This parameter is mandatory if Stage 1 is run (see details). |
S1_RbowtieIndexBuild |
Logical indicating whether you want to build the bowtie index or not (default=FALSE). This parameter is mandatory if Stage 1 is run (see details). |
S1_RbowtieIndexDir |
String with the directory of the bowtie
index (if |
S1_RbowtieIndexPrefix |
String with the prefix for the bowtie
indeces in |
S1_RbowtieRefDir |
A vector with the directories of the .fa files,
used if |
S2_PairedEndBAMpath |
A string with the directory of the paired-end bam file (or paired-end sam file). This parameter might not be mandatory (see details). |
S2_image |
Logical indicating whether images for the s elf-ligated/intra-chromosomal cut-off as well as pie-charts for the PET classification will be produced (default=TRUE). This parameter is mandatory if Stage 1 is run. |
S2_BlackList |
Logical indicating whether black-listed
regions will be removed from the data
based on the |
S3_fileSelfDir |
A string with the directory of the of the object of
class |
S3_image |
Logical indicating whether images for the binding site's FDR, sizes of the binding sites, sizes of binding site's upstream/downstream peaks will be created. This parameter is mandatory if Stage 3 is run. |
S3_method |
String with the FDR method used for finding
p-values of significant peaks in the data.
See |
S4_filePSFitDir |
A string with the directory of the object of
class |
S4_filePIntraDir |
A string with the directory of the object of
class |
S4_filePInterDir |
A string with the directory of the object of
class |
S4_FDR_peak |
A numeric for the FDR cut-off value of significant peaks to be used in interaction analysis (default 0.1 because MACPET is more strict in peak-calling than MACS, higher threshold than 0.1 might also be appropriate). This parameter is mandatory if Stage 4 is run. |
S4_method |
String with the FDR method used for finding
p-values of significant interactions in the data.
See |
S4_image |
Logical indicating whether images for the total peaks and total intra- and inter-chromosomal PETs used in the interaction analysis will be created. This parameter is mandatory if Stage 4 is run. |
S4_minPETs |
The minimum total PETs allowed between every two peaks (default=2). This parameter is mandatory if Stage 4 is run. |
S4_PeakExt |
Integer for the number of bp to extend each peak on either side (default 500). This parameter will both affect the interaction PETs falling into the peak and the merging of overlapping peaks. This parameter is mandatory if Stage 4 is run. |
Every stage has parameters associated with it. Parameters with prefix SA
correspond to all
stages, S0
to Stage 0, S1
to Stage 1 etc. Parameters with SA
prefix are mandatory
for every stage.
If SA_stages
parameter is given as vector, then the vector has to be continuous,
that is for example c(0:3) or c(2:3), not c(0,2,3). In general the best practice is to run
all the stages at once.
The fastq files in S0_fastq1
and S0_fastq2
have to be of same length and be sorted
by their ID. Furthermore, the IDs in S0_fastq1
have to end with /1 and the ones in
S0_fastq2
with /2, representing the 5- and 3-end tags respectively. In other words,
for the same line in S0_fastq1
and S0_fastq2
, their IDs have to be
identical, except form their suffixes /1 and /2 respectively. Moreover, the '/' symbol
can be replaced with any other symbol, this will not cause any problems.
S0_LinkerOccurence
parameter defines the linker-occurence mode and separates the
usable from the ambiguous PETs. PETs with both reads including linkers are
not affeted by S0_LinkerOccurence
. Also, reads which do not meet the
S0_MaxReadLength
/S0_MinReadLength
lengths, are moved to
ambiguous anyway. The four values of S0_LinkerOccurence
are:
Mode 0:
Both reads have to include a linker in order to be checked as usable or chimeric, if they dont, they are moved to ambiguous.
Mode 1:
If read 1 is not matching any linker, but read 2 does, then the PET will be moved to usable.
Mode 2:
If read 2 is not matching any linker, but read 1 does, then the PET will be moved to usable.
Mode 3:
If any of the reads does not match any linker then the PET they will be moved to usable.
Mode 4:
If both reads do not match any of the linkers, then the PET will be moved to usable.
S0_MaxReadLength
has to be greater than S0_MinReadLength
. The user should leave
those two at default unless the PET data is produced by tagmentation.
S1_fastq1_usable_dir
and S1_fastq2_usable_dir
are not mandatory if Stage 0 is
run right before Stage 1 (SA_stages
=c(0,1)). Those two are only mandatory if
Stage 1 is run separately. Then those parameters assume to have the usable reads only. The same
fastq specifications apply as those for S0_fastq1
and S0_fastq2
.
The parameter S1_genome
is very important. First the genome name given in S1_genome
should be the same as the one used for building the bowtie index for mapping.
This parameter will add an 'AS' column to the paired-end bam file with the genome information.
In Stage 2, this header will be used for identifying which kind on black-listed regions
to use if S2_BlackList==TRUE
.
If S1_RbowtieIndexBuild==FALSE
then the bowtie index is assumed to be already built and saved in
S1_RbowtieIndexDir
. Then the S1_RbowtieIndexDir
folder should include the following
files: S1_RbowtieIndexPrefix.1.ebwt
, S1_RbowtieIndexPrefix.2.ebwt
,
S1_RbowtieIndexPrefix.3.ebwt
, S1_RbowtieIndexPrefix.4.ebwt
,
S1_RbowtieIndexPrefix.rev.1.ebwt
and S1_RbowtieIndexPrefix.rev.2.ebwt
,
or with .ebwtl. Where S1_RbowtieIndexPrefix
is also given as input.
If S1_RbowtieIndexBuild==TRUE
then the bowtie index will be build using the
bowtie_build
function. This function will need the .fa files
which should be given as input in the S1_RbowtieRefDir
vector. This is a character vector
with the directories of the .fa files to use. The output index will be saved in
S1_RbowtieIndexDir
. if S1_RbowtieIndexBuild==FALSE
then S1_RbowtieRefDir
can be an empty string.
The parameter S2_PairedEndBAMpath
has to be specified only if Stage 2 is run without running
Stage 1 right before (SA_Stages=c(2) or c(2,3), not c(1,2) or c(0,1,2) for example).
If this is the case, the S2_PairedEndBAMpath
has to be the path to the BAM/SAM paired-end file.
The file has to include the header with the 'SN', 'LN' and 'AS' columns. Moreover the mate flags
of the file have to be correct and also the duplicated PETs must be flagged too. Stage 2 will
upload the whole data in R using readGAlignmentPairs
function
with flags isDuplicate=FALSE
and isPaired=TRUE
. So if duplicated PETs are not
flagged, they will be used in the analysis. If the previous
stages are run in sequence, then S2_PairedEndBAMpath
will be overwritten with
the newly created BAM file, which will have the correct flags.
If S2_BlackList==TRUE
then which genome black-list is going to be used
is decided by the 'AS' column in the S2_PairedEndBAMpath
file, which is
specified by the S1_genome
if Stage 1 is also run. The black-listed regions cover
the following genomes: 'hg19', 'ce10', 'dm3', 'hg38', 'mm9', 'mm10'.
If the 'AS' header column is missing from the S2_PairedEndBAMpath
file, or
if the S1_genome
is not matching any of the above named genomes, then
a warning will be produced saying that no black-listed regions will be removed.
Alternatily, the user can provide its own black-listed regions as a GRanges
object.
The parameter S3_fileSelfDir
is not mandatory if the stages are run in sequence,
if Stage 2 is run right before stage 3. If this is the case then S3_fileSelfDir
will be overwritten with the data produced in Stage 2. If Stage 3 is run
separately, then S3_fileSelfDir
has to be provided. It should be
a PSelf
object and both the name of the object in the directory
and the one uploaded in R should be SA_prefix_pselfData
.
All outputs are saved at the SA_AnalysisDir
. The output depents of the stages run:
S0_results
in SA_AnalysisDir
)SA_prefix_usable_1.fastq.gz:
fastq.gz files with the usable 5-end tags. To be used in Stage 1.
SA_prefix_usable_2.fastq.gz:
fastq.gz files with the usable 3-end tags. To be used in Stage 1.
SA_prefix_chimeric_1.fastq.gz:
fastq.gz files with the chimeric 5-end tags.
SA_prefix_chimeric_2.fastq.gz:
fastq.gz files with the chimeric 3-end tags.
SA_prefix_ambiguous_1.fastq.gz:
fastq.gz files with the ambiguous 5-end tags.
SA_prefix_ambiguous_2.fastq.gz:
fastq.gz files with the ambiguous 3-end tags.
SA_prefix_stage_0_image.jpg:
Pie chart image with the split of two fastq files used as input (if S0_image==TRUE
).
S1_results
in SA_AnalysisDir
)SA_prefix_usable_1.sam:
sam file with the mapped 5-end reads (if S1_makeSam==FALSE
).
SA_prefix_usable_2.sam:
sam file with the mapped 3-end reads (if S1_makeSam==FALSE
).
SA_prefix_Paired_end.bam:
paired-end bam file with the mapped PETs. To be used in Stage 2
SA_prefix_Paired_end.bam.bai:
.bai file for SA_prefix_Paired_end.bam
. To be used in Stage 2.
SA_prefix_stage_1_p1_image.jpg:
Pie-chart for the mapped/unmapped reads from SA_prefix_usable_1.sam
, SA_prefix_usable_2.sam
(if S1_image==TRUE
).
SA_prefix_stage_1_p2_image.jpg:
Pie-chart for the paired/unpaired reads of SA_prefix_Paired_end.bam
(if S1_image==TRUE
).
S2_results
in SA_AnalysisDir
)SA_prefix_pselfData:
An object of PSelf
class with
the Self-ligated PETs. To be used in Stage 3.
SA_prefix_pintraData:
An object of PIntra
class with
the Intra-chromosomal PETs.
SA_prefix_pinterData:
An object of PInter
class with
the Inter-chromosomal PETs.
SA_prefix_stage_2_p1_image.jpg:
Pie-chart reliable/dublicated/black-listed PETs of SA_prefix_Paired_end.bam
(if S2_image==TRUE
).
SA_prefix_stage_2_p2_image.jpg:
Histogram with the self-ligated/intra-chromosomal cut-off for SA_prefix_Paired_end.bam
(if S2_image==TRUE
).
SA_prefix_stage_2_p3_image.jpg:
Pie-chart for the self-ligated/intra-chromosomal/inter-chromosomal PETs of SA_prefix_Paired_end.bam
(if S2_image==TRUE
).
S3_results
in SA_AnalysisDir
)SA_prefix_psfitData:
An object of PSFit
class with the peak information.
SA_prefix_stage_3_p1_image.jpg:
Sizes of the upstream vs downstream peaks of each binding site given the binding site's FDR (if S3_image==TRUE
).
SA_prefix_stage_3_p2_image.jpg:
FDR of the binding sites. The horizontal red line is at FDR=0.05 (if S3_image==TRUE
).
SA_prefix_stage_3_p3_image.jpg:
Comparison of binding site sizes given their FDR (if S3_image==TRUE
).
SA_prefix_stage_3_p3_image.jpg:
FDR for the upstream/donwstream peaks of the binding sites given the binding sites FDR (if S3_image==TRUE
).
S4_results
in SA_AnalysisDir
)SA_prefix_GenomeMapData:
An object of GenomeMap
class with the interactions information.
SA_prefix_stage_4_p1_image.jpg:
Pie charts with the total peaks involved in the interactions and the total intra/inter-chromosomal PETs involved in the interactions (if S4_image==TRUE
).
All the above outputs. Furthermore, a log file named SA_prefix_analysis.log
is always
created in SA_AnalysisDir
with information about the process.
MACPETUlt
runs a complete or partial analysis
for PET data, depending on the stages of the analysis the user wants to run.
The stages of the analysis are the following:
Linker identification stage: This stage uses the two fastq files for the 5- and 3-end tags and identifies
which tags contain any of the linkers. Based on the
linker combinations it classifies the PETs as usable (linkers A/A or B/B),
chimeric (linkers A/B or B/A)
and ambiguous (linkers non/A, non/B, A/non, B/non unless chosen otherwise by S0_LinkerOccurence
,
or be smaller/bigger than the S0_MinReadLength
/S0_MaxReadLength
after the linker removal, respectively).
Only usable PETs are considered in the subsequent steps.
PET mapping stage: This stage uses the usable PETs identified by stage 0. It maps them separately to the
reference genome using the bowtie
function with no mismatch per read,
and keeps the uniquely mapped reads only. It then maps the unmapped reads to the reference genome
with at most one mismatch and keeps the uniquely mapped reads. Uniquely mapped reads with
zero or one mismatch are then merged and paired, their duplicates are marked and a paired-end bam file
is created which is used in State 2.
PET classification stage: This stage takes the BAM paired-end file from stage 1 and classifies the PETs as: Inter-chromosomal PETs (which connect two different chromosomes), Intra-chromosomal PETs (which connect regions of the same chromosome) and Self-ligated PETs (which are used for binding site analysis). Self-ligated PETs are used for finding the protein binding sites (peaks), while Intra- and Inter-chromosomal are used for interactions between the peaks. The algorithm uses the elbow-method to seperate the Self-ligated from the Intra-chromosomal population. Note that loading the data into R might take a while depending on the size of the data.
Peak calling stage: This stage uses the Self-ligated PETs and it runs the EM algorithm to find clusters which represent candidate peaks/binding sites in 2 dimentional space using skewed generalized students-t distributions (SGT). After the peak-calling analysis is done, the algorithm assesses the significance of the candidate peaks using a local Poisson model.
Interaction analysis stage: This stage uses the peaks found in stage 3 and the intra-chromosomal
PETs found in stage 2 to discover interactions between those peaks. Inter-chromosomal PETs are not supported yet.
It first estimates the expected number of PETs in each
interaction under H0 (everything is random), based on the depth of the peaks which interact and their distance.
It then uses the Poisson distribution to call for significant interactions.
The resulting object will include all potential interactions found in the data. Then the function
GetSignInteractions
can be used for subsetting significant interactions. Therefore,
this stage needs only to be run once, then the FDR cut-off of the most significant interactions
can be given multiple times in the GetSignInteractions
function. This stages will estimate
the interactions in circles. In each circle the most significant interactions
will be added to the model (those with the lowest FDR). Then the distances between the peaks of the significant
interactions will become 0 and the rest of the distances (for the interactions not added yet) will
be adjusted based on that. Then the second circle will run and so on.
All stages can be run in parallel using the
register
function. The user has to register a parallel backhead before starting the
function.
Ioannis Vardaxis, ioannis.vardaxis@ntnu.no
Vardaxis I, Drabl<c3><b8>s F, Rye M and Lindqvist BH (2018). MACPET: Model-based Analysis for ChIA-PET. To be published.
Vardaxis I, Drabl<c3><b8>s F, Rye M and Lindqvist BH (2018). MACPET: Complete pipeline for ChIA-PET. To be published.
Consortium EP (2012) An integrated encyclopedia of DNA elements in the human genome.. Nature, 489(7414), pp. 57<e2><80><93>74. http://dx.doi.org/10.1038/nature11247.
PSelf, PIntra,
PInter, PSFit, GenomeMap,
summary
,
AnalysisStatistics
, plot
BiocParallel
, ConvertToPSelf
,
exportPeaks
, TagsToGInteractions
,
PeaksToGRanges
, PeaksToNarrowPeak
,
ConvertToPE_BAM
, GetSignInteractions
,
GetShortestPath
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | #Create a temporary forder, or anywhere you want:
SA_AnalysisDir=file.path(tempdir(),'MACPETtest')
dir.create(SA_AnalysisDir)#where you will save the results
#give directory of the BAM file:
S2_PairedEndBAMpath=system.file('extdata', 'SampleChIAPETData.bam', package = 'MACPET')
#give prefix name:
SA_prefix='MACPET'
#parallel backhead can be created using the BiocParallel package
#parallel backhead can be created using the BiocParallel package
#requireNamespace('BiocParallel')
#snow <- BiocParallel::SnowParam(workers = 4, type = 'SOCK', progressbar=FALSE)
#BiocParallel::register(snow, default=TRUE)
#-run for the whole binding site analysis:
MACPETUlt(SA_AnalysisDir=SA_AnalysisDir,
SA_stages=c(2:3),
SA_prefix=SA_prefix,
S2_PairedEndBAMpath=S2_PairedEndBAMpath,
S2_image=TRUE,
S2_BlackList=TRUE,
S3_image=TRUE)
#load results:
SelfObject=paste(SA_prefix,'_pselfData',sep='')
load(file.path(SA_AnalysisDir,'S2_results',SelfObject))
SelfObject=get(SelfObject)
class(SelfObject) # see methods for this class
IntraObject=paste(SA_prefix,'_pintraData',sep='')
load(file.path(SA_AnalysisDir,'S2_results',IntraObject))
IntraObject=get(IntraObject)
class(IntraObject) # see methods for this class
InterObject=paste(SA_prefix,'_pinterData',sep='')
load(file.path(SA_AnalysisDir,'S2_results',InterObject))
InterObject=get(InterObject)
class(InterObject) # see methods for this class
SelfFitObject=paste(SA_prefix,'_psfitData',sep='')
load(file.path(SA_AnalysisDir,'S3_results',SelfFitObject))
SelfFitObject=get(SelfFitObject)
class(SelfFitObject) # see methods for this class
#-----delete test directory:
unlink(SA_AnalysisDir,recursive=TRUE)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.