anomDetectBAF: BAF Method for Chromosome Anomaly Detection
In GWASTools: Tools for Genome Wide Association Studies

Description Usage Arguments Details Value Note Author(s) References See Also Examples

anomSegmentBAF for each sample and chromosome, breaks the chromosome up into segments marked by change points of a metric based on B Allele Frequency (BAF) values.

anomFilterBAF selects segments which are likely to be anomalous.

anomDetectBAF is a wrapper to run anomSegmentBAF and anomFilterBAF in one step.

anomSegmentBAF(intenData, genoData, scan.ids, chrom.ids, snp.ids,
  smooth = 50, min.width = 5, nperm = 10000, alpha = 0.001,
  verbose = TRUE)

anomFilterBAF(intenData, genoData, segments, snp.ids, centromere,
  low.qual.ids = NULL, num.mark.thresh = 15, long.num.mark.thresh = 200,
  sd.reg = 2, sd.long = 1, low.frac.used = 0.1, run.size = 10,
  inter.size = 2, low.frac.used.num.mark = 30, very.low.frac.used = 0.01, 
  low.qual.frac.num.mark = 150, lrr.cut = -2, ct.thresh = 10,
  frac.thresh = 0.1, verbose=TRUE,
  small.thresh=2.5, dev.sim.thresh=0.1, centSpan.fac=1.25, centSpan.nmark=50)

anomDetectBAF(intenData, genoData, scan.ids, chrom.ids, snp.ids,
  centromere, low.qual.ids = NULL, ...)

`intenData`	An `IntensityData` object containing the B Allele Frequency. The order of the rows of intenData and the snp annotation are expected to be by chromosome and then by position within chromosome. The scan annotation should contain sex, coded as "M" for male and "F" for female.
`genoData`	A `GenotypeData` object. The order of the rows of genoData and the snp annotation are expected to be by chromosome and then by position within chromosome.
`scan.ids`	vector of scan ids (sample numbers) to process
`chrom.ids`	vector of (unique) chromosomes to process. Should correspond to integer chromosome codes in `intenData`. Recommended to include all autosomes, and optionally X (males will be ignored) and the pseudoautosomal (XY) region.
`snp.ids`	vector of eligible snp ids. Usually exclude failed and intensity-only SNPs. Also recommended to exclude an HLA region on chromosome 6 and XTR region on X chromosome. See `HLA` and `pseudoautosomal`. If there are SNPs annotated in the centromere gap, exclude these as well (see `centromeres`).
`smooth`	number of markers for smoothing region. See `smooth.CNA` in the DNAcopy package.
`min.width`	minimum number of markers for a segment. See `segment` in the DNAcopy package.
`nperm`	number of permutations for deciding significance in segmentation. See `segment` in the DNAcopy package.
`alpha`	significance level. See `segment` in the DNAcopy package.
`verbose`	logical indicator whether to print information about the scan id currently being processed. anomSegmentBAF prints each scan id; anomFilterBAF prints a message after every 10 samples: "processing ith scan id out of n" where "ith" with be 10, 10, etc. and "n" is the total number of samples
`segments`	data.frame of segments from `anomSegmentBAF`. Names must include "scanID", "chromosome", "num.mark", "left.index", "right.index", "seg.mean". Here "left.index" and "right.index" are row indices of intenData. Left and right refer to start and end of anomaly,respectively, in position order.
`centromere`	data.frame with centromere position information. Names must include "chrom", "left.base", "right.base". Valid values for "chrom" are 1:22, "X", "Y", "XY". Here "left.base" and "right.base" are base positions of start and end of centromere location in position order. Centromere data tables are provided in `centromeres`.
`low.qual.ids`	scan ids determined to be low quality for which some segments are filtered based on more stringent criteria. Default is NULL. Usual choice are scan ids for which median BAF across autosomes > 0.05. See `sdByScanChromWindow` and `medianSdOverAutosomes`.
`num.mark.thresh`	minimum number of SNP markers in a segment to be considered for anomaly
`long.num.mark.thresh`	min number of markers for "long" segment to be considered for anomaly for which significance threshold criterion is allowed to be less stringent
`sd.reg`	number of baseline standard deviations of segment mean from a baseline mean for "normal" needed to declare segment anomalous. This number is given by abs(mean of segment - baseline mean)/(baseline standard deviation)
`sd.long`	same meaning as `sd.reg` but applied to "long" segments
`low.frac.used`	if fraction of heterozygous or missing SNP markers compared with number of eligible SNP markers in segment is below this, more stringent criteria are applied to declare them anomalous.
`run.size`	min length of run of missing or heterozygous SNP markers for possible determination of homozygous deletions
`inter.size`	number of homozygotes allowed to "interrupt" run for possible determination of homozygous deletions
`low.frac.used.num.mark`	number of markers threshold for `low.frac.used` segments (which are not declared homozygous deletions
`very.low.frac.used`	any segments with (num.mark)/(number of markers in interval) less than this are filtered out since they tend to be false positives
`low.qual.frac.num.mark`	minimum num.mark threshold for low quality scans (`low.qual.ids`) for segments that are also below low.frac.used threshold
`lrr.cut`	look for runs of LRR values below `lrr.cut` to adjust homozygous deletion endpoints
`ct.thresh`	minimum number of LRR values below `lrr.cut` needed in order to adjust
`frac.thresh`	investigate interval for homozygous deletion only if `lrr.cut` and `ct.thresh` thresholds met and (# LRR values below `lrr.cut`)/(# eligible SNPs in segment) > `frac.thresh`
`small.thresh`	sd.fac threshold use in making merge decisions involving small num.mark segments
`dev.sim.thresh`	relative error threshold for determining similarity in BAF deviations; used in merge decisions
`centSpan.fac`	thresholds increased by this factor when considering filtering/keeping together left and right halves of centromere spanning segments
`centSpan.nmark`	minimum number of markers under which centromere spanning segments are automatically filtered out
`...`	arguments to pass to `anomFilterBAF`

anomSegmentBAF uses the function segment from the DNAcopy package to perform circular binary segmentation on a metric based on BAF values. The metric for a given sample/chromosome is sqrt(min(BAF,1-BAF,abs(BAF-median(BAF))) where the median is across BAF values on the chromosome. Only BAF values for heterozygous or missing SNPs are used.

anomFilterBAF determines anomalous segments based on a combination of thresholds for number of SNP markers in the segment and on deviation from a "normal" baseline. (See num.mark.thresh,long.num.mark.thresh, sd.reg, and sd.long.) The "normal" baseline metric mean and standard deviation are found across all autosomes not segmented by anomSegmentBAF. This is why it is recommended to include all autosomes for the argument chrom.ids to ensure a more accurate baseline.

Some initial filtering is done, including possible merging of consecutive segments meeting sd.reg threshold along with other criteria (such as not spanning the centromere) and adjustment for accurate break points for possible homozygous deletions (see lrr.cut, ct.thresh, frac.thresh, run.size, and inter.size). Male samples for X chromosome are not processed.

More stringent criteria are applied to some segments (see low.frac.used,low.frac.used.num.mark, very.low.frac.used, low.qual.ids, and low.qual.frac.num.mark).

anomDetectBAF runs anomSegmentBAF with default values and then runs anomFilterBAF. Additional parameters for anomFilterBAF may be passed as arguments.

anomSegmentBAF returns a data.frame with the following elements: Left and right refer to start and end of anomaly, respectively, in position order.

`scanID`	integer id of scan
`chromosome`	chromosome as integer code
`left.index`	row index of intenData indicating left endpoint of segment
`right.index`	row index of intenData indicating right endpoint of segment
`num.mark`	number of heterozygous or missing SNPs in the segment
`seg.mean`	mean of the BAF metric over the segment

anomFilterBAF and anomDetectBAF return a list with the following elements:

`raw`	data.frame of raw segmentation data, with same output as `anomSegmentBAF` as well as: `left.base`: base position of left endpoint of segment `right.base`: base position of right endpoint of segment `sex`: sex of scan.id coded as "M" or "F" `sd.fac`: measure of deviation from baseline equal to abs(mean of segment - baseline mean)/(baseline standard deviation); used in determining anomalous segments
`filtered`	data.frame of the segments identified as anomalies, with the same columns as `raw` as well as: `merge`: TRUE if segment was a result of merging. Consecutive segments from output of `anomSegmentBAF` that meet certain criteria are merged. `homodel.adjust`: TRUE if original segment was adjusted to narrow in on a homozygous deletion `frac.used`: fraction of (eligible) heterozygous or missing SNP markers compared with total number of eligible SNP markers in segment
`base.info`	data frame with columns: `scanID`: integer id of scan `base.mean`: mean of non-anomalous baseline. This is the mean of the BAF metric for heterozygous and missing SNPs over all unsegmented autosomes that were considered. `base.sd`: standard deviation of non-anomalous baseline `chr.ct`: number of unsegmented chromosomes used in determining the non-anomalous baseline
`seg.info`	data frame with columns: `scanID`: integer id of scan `chromosome`: chromosome as integer `num.segs`: number of segments produced by `anomSegmentBAF`

It is recommended to include all autosomes as input. This ensures a more accurate determination of baseline information.

Cecelia Laurie

See references in segment in the package DNAcopy. The BAF metric used is modified from Itsara,A., et.al (2009) Population Analysis of Large Copy Number Variants and Hotspots of Human Genetic Disease. American Journal of Human Genetics, 84, 148–161.

segment and smooth.CNA in the package DNAcopy, also findBAFvariance, anomDetectLOH

library(GWASdata)
data(illuminaScanADF, illuminaSnpADF)

blfile <- system.file("extdata", "illumina_bl.gds", package="GWASdata")
bl <- GdsIntensityReader(blfile)
blData <-  IntensityData(bl, scanAnnot=illuminaScanADF, snpAnnot=illuminaSnpADF)

genofile <- system.file("extdata", "illumina_geno.gds", package="GWASdata")
geno <- GdsGenotypeReader(genofile)
genoData <-  GenotypeData(geno, scanAnnot=illuminaScanADF, snpAnnot=illuminaSnpADF)

# segment BAF
scan.ids <- illuminaScanADF$scanID[1:2]
chrom.ids <- unique(illuminaSnpADF$chromosome)
snp.ids <- illuminaSnpADF$snpID[illuminaSnpADF$missing.n1 < 1]
seg <- anomSegmentBAF(blData, genoData, scan.ids=scan.ids,
                      chrom.ids=chrom.ids, snp.ids=snp.ids)

# filter segments to detect anomalies
data(centromeres.hg18)
filt <- anomFilterBAF(blData, genoData, segments=seg, snp.ids=snp.ids,
                      centromere=centromeres.hg18)

# alternatively, run both steps at once
anom <- anomDetectBAF(blData, genoData, scan.ids=scan.ids, chrom.ids=chrom.ids,
                      snp.ids=snp.ids, centromere=centromeres.hg18)

close(blData)
close(genoData)