R/callMethylation.R

#' callMethylation
#'
#' @description
#' This function calls cytosine methylation and stores calls in BAM files.
#'
#' @details
#' The function makes cytosine methylation calls for short-read methylation
#' (bisulfite/enzymatic) sequencing alignments from input BAM file and writes
#' them in the XM tag of the output BAM file. Calls are made on the basis of
#' reference (e.g., genomic) sequence and observed sequence and cytosine context
#' of reads. Data reading/processing is done by means of HTSlib,
#' therefore it is possible to significantly (>5x) speed up the calling
#' using several (4-8) HTSlib decompression threads.
#' 
#' Methylation calling with this function is only possible for sequencing data
#' obtained using either bisulfite or other similar sequencing method
#' (enzymatic methylation sequencing). Cytosine methylation in long-read,
#' native DNA sequencing alignments should be called using other, appropriate
#' tools.
#' 
#' It is a requirement that the genomic strand the read was aligned
#' to is known. This information is typically stored in XG tag of
#' Bismark/Illumina BAM files, or in YD tag of BWA-meth alignment files, or in
#' ZS tag of BSMAP alignment files.
#' `epialleleR` is aware of that and will use the whichever tag is available.
#' 
#' The sequence context of cytosines (h/H for "CHH", x/X for "CHG",
#' z/Z for "CG") is determined based on the
#' actual (observed) sequence of the read. E.g., if read "ACGT" was aligned
#' to the forward strand of reference sequence "ACaaGT" with the CIGAR string
#' "2M2D2M" (2 bases match,
#' 2 reference bases are deleted, 2 bases match), then methylation call string
#' will be ".Z.." (in contrast to the reference's one of ".H...."). This
#' makes cytosine calls nearly identical to ones produced by Bismark Bisulfite
#' Read Mapper and Methylation Caller or Illumina DRAGEN Bio IT Platform,
#' however with one important distinction: `epialleleR` reports sequence context
#' of cytosines followed by unknown bases ("CNN") as "H.." instead of "U.."
#' (unknown; as for example Illumina DRAGEN Bio IT Platform does). Similarly,
#' forward strand context of "CNG" is reported as "X..", forward strand context
#' of "CGN" -> "Z..", reverse strand context of "NNG" -> "..H", reverse strand
#' context of "CNG" -> "..X", reverse strand context of "NCG" -> "..Z".
#' Both lowercase and uppercase ACGTN symbols in reference sequence are allowed
#' and correctly recognised, however all the other symbols (e.g., extended IUPAC
#' symbols, MRSVWYHKDB) within sequences are converted to N.
#' 
#' As a reference sequence, the function expects either location of
#' (preferably `bgzip`ped) FASTA file or an object obtained by
#' \code{\link{preprocessGenome}}. The latter is recommended if methylation
#' calling is to be performed on multiple BAM files.
#' 
#' The alignment records of the output BAM file will contain additional XM tag
#' with the methylation
#' call string for every mapped read which did not have XM tag available.
#' Besides that, XG tag with reference sequence strand ("CT" or "GA") is added
#' to such reads in case it wasn't present.
#' 
#' Please note that for the purpose of methylation calling, the very same
#' reference genome must be used for both alignment (when BAM is produced) and
#' calling cytosine methylation by \code{\link{callMethylation}} method.
#' Exception is thrown if reference sequence header of BAM file doesn't match
#' reference sequence data provided (this matching is performed on the basis
#' of names and lengths of reference sequences).
#'
#' @param input.bam.file input BAM file location string.
#' @param output.bam.file output BAM file location string.
#' @param genome reference (genomic) sequences file location string or an
#' output of \code{\link{preprocessGenome}}.
#' @param nthreads non-negative integer for the number of additional HTSlib
#' threads to be used during file decompression (default: 1).
#' @param verbose boolean to report progress and timings (default: TRUE).
#' @return list object with simple statistics of processed ("nrecs") records
#' and calls made ("ncalled"). Even though "ncalled" can be less than "nrecs"
#' (e.g., because not all reads are mapped), all records from the input BAM are
#' written to the output BAM.
#' @seealso \code{\link{preprocessGenome}} for preloading reference sequences
#' and `epialleleR` vignettes for the description of usage and sample data.
#' 
#' \href{https://www.bioinformatics.babraham.ac.uk/projects/bismark/}{Bismark} Bisulfite Read Mapper and Methylation Caller,
#' \href{https://arxiv.org/abs/1401.1129}{bwa-meth} for fast and accurate alignment of long bisulfite-seq reads,
#' \href{https://doi.org/10.1186/1471-2105-10-232}{BSMAP}: whole genome bisulfite sequence MAPping program,
#' or info on \href{https://support-docs.illumina.com/SW/dragen_v42/Content/SW/DRAGEN/MPipelineMeth_fDG.htm}{Illumina DRAGEN Bio IT Platform}.
#' @examples
#'   callMethylation(
#'     input.bam.file=system.file("extdata", "test", "dragen-se-unsort-xg.bam", package="epialleleR"),
#'     output.bam.file=tempfile(pattern="output-", fileext=".bam"),
#'     genome=system.file("extdata", "test", "reference.fasta.gz", package="epialleleR")
#'   )
#' @export
callMethylation <- function (input.bam.file, output.bam.file, genome,
                             nthreads=1,
                             verbose=TRUE)
{
  genome <- preprocessGenome(genome.file=genome, nthreads=nthreads,
                             verbose=verbose)
  
  result <- .callMethylation(input.bam.file=input.bam.file,
                             output.bam.file=output.bam.file,
                             genome=genome, nthreads=nthreads,
                             verbose=verbose)
  return(result)
}
BBCG/epialleleR documentation built on March 24, 2024, 11:32 p.m.