processForSegmentation: Process reads counts from BAM files to prepare input for...

View source: R/processForSegmentation.R

processForSegmentationR Documentation

Process reads counts from BAM files to prepare input for segmentation algorithms

Description

processForSegmentation is a wrapper function that reads in BAM files and carries out binning, filtering, bias correcting, smoothing and normalizing of the read counts using functions of the QDNAseq package.

Usage

processForSegmentation(bamfiles = NULL, bamnames = NULL,
  refSamples = NULL, pathToBams = NULL, ext = "bam", binSize = NULL,
  genome = "hg19", outputType = "CNAclinicData",
  typeOfPreMadeBins = "SR50", userMadeBins = NULL,
  cache = getOption("QDNAseq::cache", FALSE), minMapq = 20,
  pairedEnds = NULL, isPaired = NA, isProperPair = NA,
  isUnmappedQuery = FALSE, hasUnmappedMate = NA, isMinusStrand = NA,
  isMateMinusStrand = NA, isFirstMateRead = NA, isSecondMateRead = NA,
  isSecondaryAlignment = NA, isDuplicate = FALSE, residualFilter = TRUE,
  blacklistFilter = TRUE, mappabilityFilter = 15,
  chromosomesFilter = c("X", "Y", "M", "MT"), spanForLoess = 0.65,
  familyForLoess = "symmetric", maxIterForCorrection = 1,
  cutoffForCorrection = 4, variablesForCorrection = c("gc", "mappability"),
  methodOfCorrection = "ratio", methodOfNormalization = "median",
  logTransformForSmoothing = TRUE, skipMedianNormalization = FALSE,
  skipOutlierSmoothing = FALSE, saveCountData = FALSE,
  filename = "corrected_QDNAseqCopyNumbers")

Arguments

bamfiles

A character vector of BAM file names with or without full path. If NULL (default), all files with extension .bam, are read from directory path.

bamnames

An optional character vector of sample names. Defaults to file names with extension .bam removed. bamnames must be provided if refSamples is not NULL.

refSamples

An optional character vector of the reference sample names that are to be used in normalizing each sample in bamnames. If not NULL (default), refSamples must be the same length as bamnames and should only include sample names contained in bamnames. See vignette for further details.

pathToBams

If bamfiles is NULL, all files ending with ".bam" extension will be read from this path. If NULL, defaults to the current working directory.

ext

Input files extension. Defaults to "bam".

binSize

A numeric scalar specifying the width of the bins in units of kbp (1000 base pairs), e.g. binSize=50 corresponds to 50 kbp bins.

genome

Genome build used to align sequencing reads. Currently, CNAclinic only allows "hg19" (default). Also see: userMadeBins

outputType

Return an object of class "QDNAseqCopyNumbers" or "CNAclinicData" (default).

typeOfPreMadeBins

A character string to specify the read type (single/paired) and length used to generate pre-made annotation. e.g "SR50" (default) or "PE100".

userMadeBins

An optional data.frame or an AnnotatedDataFrame object containing bin annotations created using the createBins function. Consult the QDNAseq vignette for further information.

cache

Whether to read and write intermediate cache files, which speeds up subsequent analyses of the same files. Requires packages R.cache and digest (both available on CRAN) to be installed. Defaults to getOption("QDNAseq::cache", FALSE)

minMapq

If quality scores exists, the minimum quality score required in order to keep a read (20, default).

pairedEnds

A boolean value or vector specifying whether the BAM files contain paired-end data or not.

isPaired

A logical(1) indicating whether unpaired (FALSE), paired (TRUE), or any (NA, default) read should be returned.

isProperPair

A logical(1) indicating whether improperly paired (FALSE), properly paired (TRUE), or any (NA, default) read should be returned. A properly paired read is defined by the alignment algorithm and might, e.g., represent reads aligning to identical reference sequences and with a specified distance.

isUnmappedQuery

A logical(1) indicating whether unmapped (TRUE), mapped (FALSE, default), or any (NA) read should be returned.

hasUnmappedMate

A logical(1) indicating whether reads with mapped (FALSE), unmapped (TRUE), or any (NA, default) mate should be returned.

isMinusStrand

A logical(1) indicating whether reads aligned to the plus (FALSE), minus (TRUE), or any (NA, default) strand should be returned.

isMateMinusStrand

A logical(1) indicating whether mate reads aligned to the plus (FALSE), minus (TRUE), or any (NA, default) strand should be returned.

isFirstMateRead

A logical(1) indicating whether the first mate read should be returned (TRUE) or not (FALSE), or whether mate read number should be ignored (NA, default).

isSecondMateRead

A logical(1) indicating whether the second mate read should be returned (TRUE) or not (FALSE), or whether mate read number should be ignored (NA, default).

isSecondaryAlignment

A logical(1) indicating whether alignments that are primary (FALSE), are not primary (TRUE) or whose primary status does not matter (NA, default) should be returned.

isDuplicate

A logical(1) indicating that un-duplicated (FALSE, default), duplicated (TRUE), or any (NA) reads should be returned.

residualFilter

Either a logical specifying whether to filter based on loess residuals of the calibration set or if a numeric, the number of standard deviations to use as the cutoff. Default is TRUE, which corresponds to 4.0 standard deviations.

blacklistFilter

Either a logical specifying whether to filter based on overlap with ENCODE blacklisted regions, or if numeric, the maximum percentage of overlap allowed. Default is @TRUE, which corresponds to no overlap allowed (i.e. value of 0).

mappabilityFilter

A numeric in [0,100] to specify filtering out bins with mappabilities lower than the number specified (15, default). FALSE will not filter based on mappability.

chromosomesFilter

A character vector specifying which chromosomes to filter out. Defaults to the sex chromosomes and mitochondrial reads, i.e. c("X", "Y", "M", "MT"). Use NA to use all chromosomes.

spanForLoess

For @see "stats::loess", the parameter alpha which controls the degree of smoothing.

familyForLoess

For @see "stats::loess", if "gaussian" fitting is by least-squares, and if "symmetric" a re-descending M estimator is used with Tukey's biweight function.

maxIterForCorrection

An integer(1) specifying the maximum number of iterations to perform, default is 1. If larger, after the first loess fit, bins with median residuals larger than cutoffForCorrection are removed, and the fitting repeated until the list of bins to use stabilizes or after maxIter iterations.

cutoffForCorrection

A numeric(1) specifying the number of standard deviations (as estimated with @see "matrixStats::madDiff") the cutoff for removal of bins with median residuals larger than the cutoff. Not used if maxIter=1 (default).

variablesForCorrection

A character vector specifying which variables to include in the correction. Can be c("gc", "mappability") (the default) or "gc", or "mappability".

methodOfCorrection

A character string speficying the correction method. ratio (default) divides counts with fit. median calculates the median fit, and defines the correction for bins with GC content gc and mappability map as median(fit) - fit(gc,map), which is added to counts. Method none leaves counts untouched.

methodOfNormalization

A character string specifying the normalization method. Choices are "mean", "median" (default), or "mode".

logTransformForSmoothing

If TRUE (default), data will be log2-transformed for smoothing.

skipMedianNormalization

Skip this step if TRUE. Recommended when normalizing by refSamples

skipOutlierSmoothing

Skip this specific step if TRUE.

saveCountData

Save an object of class QDNAseqCopyNumbers after the GC/mappability correction step. default is FALSE

filename

Filename to save the before mentioned object.

Value

Returns an object of class CNAclinicData (default) or QDNAseqCopyNumbers

Author(s)

Dineika Chandrananda

See Also

Internally, the following functions of the QDNAseq package are used: getBinAnnotations, binReadCounts, applyFilters, estimateCorrection, correctBins, normalizeBins, smoothOutlierBins and compareToReference

Examples

     ## Not run: 
      vignette("CNAclinic")
     
## End(Not run)


sdchandra/CNAclinic documentation built on Aug. 8, 2024, 4:08 p.m.