MACPET

BiocStyle::latex

\section{Introduction}

The \Biocpkg{MACPET} package can be used for general analysis of paired-end (PET) data like ChIA-PET. \Biocpkg{MACPET} currently implements the following five stages:

\Biocpkg{MACPET} identifies binding site locations more accurately than other algorithms which use only one end (like MACS) [@macpet]. The output from Stage 3 in \Biocpkg{MACPET} can be used for interaction analysis using either MANGO or MICC, or the user can run State 4 in \Biocpkg{MACPET} for interaction analysis. Note that in the case of using the output from \Biocpkg{MACPET} in MANGO or MICC for interaction analysis, the user should use the self-ligated cut-off found by \Biocpkg{MACPET}, and not the one found in MANGO or MICC. Both of those algorithms allow the user to specify the self-ligated cut-off. MACPET is mainly written in C++, and it supports the \Biocpkg{BiocParallel} package.

Before starting with examples of how to use \Biocpkg{MACPET}, create a test folder to save all the output files of the examples presented in this vignette:

#Create a temporary test folder, or anywhere you want:
SA_AnalysisDir=file.path(tempdir(),"MACPETtest")
dir.create(SA_AnalysisDir)#where you will save the results.

Load the package:

library(MACPET)

\section{\Biocpkg{MACPET} Classes} \label{sec:classes}

\Biocpkg{MACPET} provides five different classes which all inherit from the \Rclass{GInteractions} class in the \Biocpkg{InteractionSet} package. Therefore, every method associated with the \Rclass{GInteractions} class is also applicable to the \Biocpkg{MACPET} classes. Every \Biocpkg{MACPET} class contains information of the PETs associated with the corresponding class, their start/end coordinates on the genome as well as which chromosome they belong to. This section provides an overview of the \Biocpkg{MACPET} classes, while methods associated with each class are presented in latter sections. The classes provided by \Biocpkg{MACPET} are the following:

\subsection{\textcolor{LimeGreen}{\Rclass{PSelf}} Class}

The \textcolor{LimeGreen}{\Rclass{PSelf}} class contains pair-end tag information of self-ligated PETs which is used for binding site analysis.

load(system.file("extdata", "MACPET_pselfData.rda", package = "MACPET"))
class(MACPET_pselfData) #example name
MACPET_pselfData #print method

Extra information of this class is stored as list in the metadata entries with the following elements:

metadata(MACPET_pselfData)

One can also access information about chromosome lengths etc.

seqinfo(MACPET_pselfData)

\subsection{\textcolor{LimeGreen}{\Rclass{PSFit}} Class}

The \textcolor{LimeGreen}{\Rclass{PSFit}} class adds information to the \textcolor{LimeGreen}{\Rclass{PSelf}} class about the peak each PET belongs to, as well as the total number of peaks in each chromosome in the data, p-values and FDR for each peak.

load(system.file("extdata", "MACPET_psfitData.rda", package = "MACPET"))
class(MACPET_psfitData) #example name
MACPET_psfitData #print method

This class updates the Self_info data frame of the \textcolor{LimeGreen}{\Rclass{PSelf}} class with two extra columns: the total regions each chromosome is segmented into (Region.counts) and the total candidate peaks of each chromosome (Peak.counts). Moreover, this class contains a metadata entry which is a matrix containing region and peak IDs for each PET in the data (Classification.Info). Finally, it also contains a metadata entry with information about each peak found (Peaks.Info). Peaks.Info is a data.frame with the following entries:

head(metadata(MACPET_psfitData)$Peaks.Info)

One can also access information about chromosome lengths etc, using
\Rfunction{seqinfo(MACPET_psfitData)}.

\subsection{\textcolor{LimeGreen}{\Rclass{PInter}} Class}

The \textcolor{LimeGreen}{\Rclass{PInter}} class contains pair-end tag information of Inter-chromosomal PETs:

load(system.file("extdata", "MACPET_pinterData.rda", package = "MACPET"))
class(MACPET_pinterData) #example name
MACPET_pinterData #print method

One can also access information about chromosome lengths etc, using
\Rfunction{seqinfo(MACPET_pinterData)}.

It also contains a two-element metadata list with the following elements:

metadata(MACPET_pinterData)

\subsection{\textcolor{LimeGreen}{\Rclass{PIntra}} Class}

The \textcolor{LimeGreen}{\Rclass{PIntra}} class contains pair-end tag information of Intra-chromosomal PETs.

load(system.file("extdata", "MACPET_pintraData.rda", package = "MACPET"))
class(MACPET_pintraData)#example name
MACPET_pintraData#print method

One can also access information about chromosome lengths etc, using
\Rfunction{seqinfo(MACPET_pintraData)}.

It also contains a two-element metadata list with the following elements:

metadata(MACPET_pintraData)

\subsection{\textcolor{LimeGreen}{\Rclass{GenomeMap}} Class}

The \textcolor{LimeGreen}{\Rclass{GenomeMap}} class contains all potential interactions between pairs of peaks, as well as the peaks' anchors.

load(system.file("extdata", "MACPET_GenomeMapData.rda", package = "MACPET"))
class(MACPET_GenomeMapData) #example name
MACPET_GenomeMapData #print method

Extra information of this class is stored as list in the metadata entries with the following elements:

Each row in the metadata entry corresponds to the same row in the main object.

metadata(MACPET_GenomeMapData)

\section{\Biocpkg{MACPET} Methods}

This section describes methods associated with the classes in the \Biocpkg{MACPET} package.

\subsection{summary-method} All \Biocpkg{MACPET} classes are associated with a summary method which sums up the information stored in each class:

\subsubsection{\textcolor{LimeGreen}{\Rclass{PSelf}} Class}

\Rfunction{summary} for \textcolor{LimeGreen}{\Rclass{PSelf}} class prints information about the total number of self-ligated PETs for each chromosome, as well as the total number of self-ligated PETs in the data, their min/max length and genome information of the data:

class(MACPET_pselfData)
summary(MACPET_pselfData)

\subsubsection{\textcolor{LimeGreen}{\Rclass{PSFit}} Class} \Rfunction{summary} for \textcolor{LimeGreen}{\Rclass{PSFit}} class adds information to the \Rfunction{summary} of \textcolor{LimeGreen}{\Rclass{PSelf}} class. The new information is the total regions found and analysed for each chromosome and the total number of candidate binding sites found on each chromosome:

class(MACPET_psfitData)
summary(MACPET_psfitData)

\subsubsection{\textcolor{LimeGreen}{\Rclass{PIntra}} Class} \Rfunction{summary} for \textcolor{LimeGreen}{\Rclass{PIntra}} class prints information about the total number of intra-ligated PETs for each chromosome, as well as information about the genome. The user can choose to plot a heat-map for the total number of intra-ligated PETs on each chromosome:

class(MACPET_pintraData)
requireNamespace("ggplot2")
requireNamespace("reshape2")
summary(MACPET_pintraData,heatmap=TRUE)

\subsubsection{\textcolor{LimeGreen}{\Rclass{PInter}} Class} \Rfunction{summary} for \textcolor{LimeGreen}{\Rclass{PInter}} class prints information about the total number of inter-ligated PETs for each chromosome, as well as information about the genome. The user can choose to plot a heat-map for the total number of inter-ligated PETs connecting the chromosomes:

class(MACPET_pinterData)
requireNamespace("ggplot2")
requireNamespace("reshape2")
summary(MACPET_pinterData,heatmap=TRUE)

\subsubsection{\textcolor{LimeGreen}{\Rclass{GenomeMap}} Class}

\Rfunction{summary} for \textcolor{LimeGreen}{\Rclass{GenomeMap}} class prints information about the total number of interactions in the data. The user can provide a threshold for the FDR cut-off of the significant interactions to make the summary from. Alternatively if threshold=NULL all the interactions will be used for the summary.

class(MACPET_GenomeMapData)
summary(MACPET_GenomeMapData)

\subsection{plot-method} All \Biocpkg{MACPET} classes are associated with a plot method which can be used to visualize counts, PETs in a region, as well as binding sites. Here we give some examples for the usage of the plot methods, however more arguments can be provided to the plot methods, see \Rpackage{MACPET::plot}.

\subsubsection{\textcolor{LimeGreen}{\Rclass{PSelf}} Class} \Rfunction{plot} for \textcolor{LimeGreen}{\Rclass{PSelf}} Class will create a bar-plot showing the total number of self-ligated PETs on each chromosome. The x-axis are the chromosomes and the y-axis are the corresponding frequencies.

requireNamespace("ggplot2")
class(MACPET_pselfData)
# PET counts plot
plot(MACPET_pselfData)

\subsubsection{\textcolor{LimeGreen}{\Rclass{PSFit}} Class} \Rfunction{plot} for \textcolor{LimeGreen}{\Rclass{PSFit}} Class will create a bar-plot (if kind="PeakCounts") showing the total number of candidate binding sites found on each chromosome. The x-axis are the chromosomes and the y-axis are the corresponding frequencies.

class(MACPET_psfitData)
#binding site couts:
plot(MACPET_psfitData,kind="PeakCounts")

Other kind of plots are also supported for this class. For example if kind="PeakPETs", then a visual representation of a region will be plotted (RegIndex chooses which region to plot with 1 meaning the one with the highest total of PETs in it). The x-axis are the genomic coordinates of the region and the y-axis if the sizes of the PETs. Each segment represents a PET from its start to its end coordinate. Different colors of colors represent which binding site each PET belongs to, with red (PeakID=0) representing the noise cluster. Vertical lines represent the exact binding location of the binding site.

# region example with binding sites:
plot(MACPET_psfitData,kind="PeakPETs",RegIndex=1)

\subsubsection{\textcolor{LimeGreen}{\Rclass{PIntra}} Class} \Rfunction{plot} for \textcolor{LimeGreen}{\Rclass{PIntra}} Class will create a bar-plot showing the total number of intra-ligated PETs on each chromosome. The x-axis are the chromosomes and the y-axis are the corresponding frequencies.

class(MACPET_pintraData)
#plot counts:
plot(MACPET_pintraData)

\subsubsection{\textcolor{LimeGreen}{\Rclass{PInter}} Class} \Rfunction{plot} for \textcolor{LimeGreen}{\Rclass{PInter}} Class. Each node represents a chromosome where the size of the node is proportional to the total number of Inter-chromosomal PETs leaving from this chromosome. Edges connect interacting chromosomes where the thickness of each edge is proportional to the total number of Inter-chromosomal PETs connecting the two chromosomes.

class(MACPET_pinterData)
requireNamespace("igraph")
#network plot:
plot(MACPET_pinterData)

\subsubsection{\textcolor{LimeGreen}{\Rclass{GenomeMap}} Class} \Rfunction{plot} for \textcolor{LimeGreen}{\Rclass{GenomeMap}} Class. Different kind of plot can be created using the Type parameter. The user can also specify a threshold for the significant interactions to make the plots from. In the following example, each node represents a chromosome and the edges show which chromosomes have significant interactions between them.

class(MACPET_GenomeMapData)
requireNamespace("igraph")
#network plot:
plot(MACPET_GenomeMapData,Type='network-circle')

\subsection{exportPeaks methods} \textcolor{LimeGreen}{\Rclass{PSFit}} class has a method which exports the binding site information stored in \Rfunction{metadata(object)[['Peaks.Info']]} into csv files in a given directory if one wishes to have the binding sites in an excel file. The user can also specify a threshold for the FDR. If no threshold is specified all the binding sites found by the algorithm are exported.

class(MACPET_psfitData)#PSFit class
exportPeaks(object=MACPET_psfitData,file.out="Peaks",threshold=1e-5,savedir=SA_AnalysisDir)

\subsection{PeaksToGRanges methods} \textcolor{LimeGreen}{\Rclass{PSFit}} class has also a method which converts the binding sites found by the peak-calling algorithm into a \Rclass{GRanges} object with start and end coordinates the binding site's confidence interval (CIQ.Up.start,CIQ.Down.end). It furthermore contains information about the total number of PETs in the peak (TotPETs), the p-value of the peak (p.value) and its FDR (FDR). The user can also specify an FDR threshold for returning significant peaks. If threshold=NULL, all the found peaks are returned.

class(MACPET_psfitData)#PSFit class
object=PeaksToGRanges(object=MACPET_psfitData,threshold=1e-5)
object

\subsection{TagsToGInteractions methods}

\textcolor{LimeGreen}{\Rclass{PSFit}} class has also a method which returns only PETs belonging to peaks (removing noisy or insignificant PETs) as a \Rclass{GInteractions} object. This might be useful if one wishes to visualize the tags belonging to PETs of binding sites on the genome-browser. The user can also specify an FDR threshold for returning significant peaks. If threshold=NULL, all the found peaks are returned.

class(MACPET_psfitData)#PSFit class
TagsToGInteractions(object=MACPET_psfitData,threshold=1e-5)

\subsection{PeaksToNarrowPeak methods} \textcolor{LimeGreen}{\Rclass{PSFit}} class has a method which converts peaks of an object of \textcolor{LimeGreen}{\Rclass{PSFit}} class to narrowPeak object. The object is saved in a user specified directory and can be used in the MANGO or MICC algorithms for interaction analysis. Alternatively, the user can use stage 4 in \Rfunction{MACPETUlt} for running the interaction analysis stage.

class(MACPET_psfitData)#PSFit class
PeaksToNarrowPeak(object=MACPET_psfitData,threshold=1e-5,
                  file.out="MACPET_peaks.narrowPeak",savedir=SA_AnalysisDir)

\subsection{ConvertToPSelf methods}

This method if for the \textcolor{LimeGreen}{\Rclass{GInteractions}} class. It converts a \Rclass{GInteractions} object to \textcolor{LimeGreen}{\Rclass{PSelf}} object. This method could be used in case the user already has the self-ligated PETs separated from the rest of the data and wishes to run a binding site analysis on those only using stage 3 in \Rfunction{MACPETUlt}. The output object will be saved in the user-specified directory.

 #--remove information and convert to GInteractions:
object=MACPET_pselfData
#--remove information and convert to GInteractions:
S4Vectors::metadata(object)=list(NULL)
class(object)='GInteractions'
#----input parameters
S2_BlackList=TRUE
SA_prefix="MACPET"
S2_AnalysisDir=SA_AnalysisDir

ConvertToPSelf(object=object,
               S2_BlackList=S2_BlackList,
               SA_prefix=SA_prefix,
               S2_AnalysisDir=S2_AnalysisDir)
#load object:
rm(MACPET_pselfData)#old object
load(file.path(S2_AnalysisDir,"MACPET_pselfData"))
class(MACPET_pselfData)

\subsection{GetSignInteractions methods} \textcolor{LimeGreen}{\Rclass{GenomeMap}} class has a method which subsets the significant interactions given a user-specified FDR threshold and returns either a \textcolor{LimeGreen}{\Rclass{GInteractions}} class object for the interactions (each row corresponds to one interaction), or it saves the significant interactions into an excel file in a user specified directory. Metadata columns are also provided which given further information about each interaction.

class(MACPET_GenomeMapData)#GenomeMap class
GetSignInteractions(object=MACPET_GenomeMapData,
                     threshold = NULL,
                     ReturnedAs='GInteractions')

\subsection{GetShortestPath methods} \textcolor{LimeGreen}{\Rclass{GenomeMap}} class has a method which finds the length of the shortest path between two user-specified peaks. Currently it only finds the shortest paths between intra-chromosomal peaks. Therefore, the peaks have to be on the same chromosome. The resulting value is a two-element list with the first element named LinearPathLength for the linear length of the path between summits of the two peaks, and the second element named ThreeDPathLength for the 3D length of the shortest path between summits of the two peaks.

class(MACPET_GenomeMapData)#GenomeMap class
GetShortestPath(object=MACPET_GenomeMapData,
                     threshold = NULL,
                     ChrFrom="chr1",
                     ChrTo="chr1",
                     SummitFrom=10000,
                     SummitTo=1000000)

\section{\Biocpkg{MACPET} Supplementary functions}

\subsection{AnalysisStatistics function}

\Rfunction{AnalysisStatistics} function can be used for all the classes of the \Biocpkg{MACPET} package for printing and/or saving statistics of the classes in csv file in a given working directory. Input for Self-ligated PETs of one of the classes (\textcolor{LimeGreen}{\Rclass{PSelf}}, \textcolor{LimeGreen}{\Rclass{PSFit}}) is mandatory, while input for the Intra- and Inter-chromosomal PETs is not.

If the input for the Self-ligated PETs is of \textcolor{LimeGreen}{\Rclass{PSFit}} class, a threshold can be given for the FDR cut-off.

Here is an example:

AnalysisStatistics(x.self=MACPET_psfitData,
                   x.intra=MACPET_pintraData,
                   x.inter=MACPET_pinterData,
                   file.out='AnalysisStats',
                   savedir=SA_AnalysisDir,
                   threshold=1e-5)

\subsection{ConvertToPE_BAM function}

\Rfunction{ConvertToPE_BAM} in case the user has two separate BAM files from read 1 and 2 of the paired data, and needs to pair them in one paired-end BAM file for further analysis in stage 2-3 on the \Rfunction{MACPETUlt} function. The output paired-end BAM file will be saved in the user-specified directory.

Here is an example:

requireNamespace('ggplot2')

#Create a temporary forder, or anywhere you want:
S1_AnalysisDir=SA_AnalysisDir

#directories of the BAM files:
BAM_file_1=system.file('extdata', 'SampleChIAPETDataRead_1.bam', package = 'MACPET')
BAM_file_2=system.file('extdata', 'SampleChIAPETDataRead_2.bam', package = 'MACPET')
SA_prefix="MACPET"

#convert to paired-end BAM:
ConvertToPE_BAM(S1_AnalysisDir=S1_AnalysisDir,
             SA_prefix=SA_prefix,
             S1_BAMStream=2000000,S1_image=TRUE,
             S1_genome="hg19",BAM_file_1=BAM_file_1,
             BAM_file_2=BAM_file_2)

#test if the resulted BAM is paired-end:
PairedBAM=file.path(S1_AnalysisDir,paste(SA_prefix,"_Paired_end.bam",sep=""))
Rsamtools::testPairedEndBam(file = PairedBAM, index = PairedBAM)

bamfile = Rsamtools::BamFile(file = PairedBAM,asMates = TRUE)
GenomicAlignments::readGAlignmentPairs(file = bamfile,use.names = FALSE,
                                with.which_label = FALSE,
                                strandMode = 1)

\section{Peak Calling Workflow}

The main function which the user can use for running a paired-end data analysis is called \Rfunction{MACPETUlt}. It consists of the five stages described in the introduction section. The user may run the whole pipeline/analysis at once using Stages=c(0:4) or step by step using a single stage at a time. The function supports the \Biocpkg{BiocParallel} package.

For the following example we run stages 2 and 4 of the \Rfunction{MACPETUlt} only. The reason is that for running state 1, the bowtie index is needed which is too big for downloading it here.

#give directory of the BAM file:
S2_PairedEndBAMpath=system.file('extdata', 'SampleChIAPETData.bam', package = 'MACPET')

#give prefix name:
SA_prefix="MACPET"

#parallel backhead can be created using the BiocParallel package
#parallel backhead can be created using the BiocParallel package
#requireNamespace('BiocParallel')
#snow <- BiocParallel::SnowParam(workers = 4, type = 'SOCK', progressbar=FALSE)
#BiocParallel::register(snow, default=TRUE)

# packages for plotting:
requireNamespace('ggplot2')

#-run for the whole binding site analysis:
MACPETUlt(SA_AnalysisDir=SA_AnalysisDir,
       SA_stages=c(2:4),
       SA_prefix=SA_prefix,
       S2_PairedEndBAMpath=S2_PairedEndBAMpath,
       S2_image=TRUE,
       S2_BlackList=TRUE,
       S3_image=TRUE,
       S4_image=TRUE,
       S4_FDR_peak=1)# the data is small so use all the peaks found.

#load results:
SelfObject=paste(SA_prefix,"_pselfData",sep="")
load(file.path(SA_AnalysisDir,"S2_results",SelfObject))
SelfObject=get(SelfObject)
class(SelfObject) # see methods for this class

IntraObject=paste(SA_prefix,"_pintraData",sep="")
load(file.path(SA_AnalysisDir,"S2_results",IntraObject))
IntraObject=get(IntraObject)
class(IntraObject) # see methods for this class

InterObject=paste(SA_prefix,"_pinterData",sep="")
load(file.path(SA_AnalysisDir,"S2_results",InterObject))
InterObject=get(InterObject)
class(InterObject) # see methods for this class

SelfFitObject=paste(SA_prefix,"_psfitData",sep="")
load(file.path(SA_AnalysisDir,"S3_results",SelfFitObject))
SelfFitObject=get(SelfFitObject)
class(SelfFitObject) # see methods for this class

GenomeMapObject=paste(SA_prefix,"_GenomeMapData",sep="")
load(file.path(SA_AnalysisDir,"S4_results",GenomeMapObject))
GenomeMapObject=get(GenomeMapObject)
class(GenomeMapObject) # see methods for this class

#-----delete test directory:
unlink(SA_AnalysisDir,recursive=TRUE)

\Rfunction{MACPETUlt} saves its outputs in SA_AnalysisDir in the folders S0_results, S1_results, S2_results, S3_results and S4_results based on the stages run. The output of \Rfunction{MACPETUlt} in those folders is the following:

Stage 0: (output saved in a folder named S0_results in SA_AnalysisDir)

Stage 1: (output saved in a folder named S1_results in SA_AnalysisDir)

Stage 2: (output saved in a folder named S2_results in SA_AnalysisDir)

Stage 3: (output saved in a folder named S3_results in SA_AnalysisDir)

Stage 4: (output saved in a folder named S4_results in SA_AnalysisDir)

Stages 0:4:

Furthermore a log-file named SA_prefix_analysis.log with the progress of the analysis is also saved in the SA_AnalysisDir.



Try the MACPET package in your browser

Any scripts or data that you put into this service are public.

MACPET documentation built on Nov. 8, 2020, 5:47 p.m.