anomIdentifyLowQuality: Identify low quality samples

View source: R/anomIdentifyLowQuality.R

anomIdentifyLowQualityR Documentation

Identify low quality samples

Description

Identify low quality samples for which false positive rate for anomaly detection is likely to be high. Measures of noise (high variance) and high segmentation are used.

Usage

anomIdentifyLowQuality(snp.annot, med.sd, seg.info,
  sd.thresh, sng.seg.thresh, auto.seg.thresh)

Arguments

snp.annot

SnpAnnotationDataFrame with column "eligible", where "eligible" is a logical vector indicating whether a SNP is eligible for consideration in anomaly detection (usually FALSE for HLA and XTR regions, failed SNPs, and intensity-only SNPs). See HLA and pseudoautosomal.

med.sd

data.frame of median standard deviation of BAlleleFrequency (BAF) or LogRRatio (LRR) values across autosomes for each scan, with columns "scanID" and "med.sd". Usually the result of medianSdOverAutosomes. Usually only eligible SNPs are used in these computations. In addition, for BAF, homozygous SNPS are excluded.

seg.info

data.frame with segmentation information from anomDetectBAF or anomDetectLOH. Columns must include "scanID", "chromosome", and "num.segs". (For anomDetectBAF, segmentation information is found in $seg.info from output. For anomDetectLOH, segmentation information is found in $base.info from output.)

sd.thresh

Threshold for med.sd above which scan is identified as low quality. Suggested values are 0.1 for BAF and 0.25 for LOH.

sng.seg.thresh

Threshold for segmentation factor for a given chromosome, above which the chromosome is said to be highly segmented. See Details. Suggested values are 0.0008 for BAF and 0.0048 for LOH.

auto.seg.thresh

Threshold for segmentation factor across autosome, above which the scan is said to be highly segmented. See Details. Suggested values are 0.0001 for BAF and 0.0006 for LOH.

Details

Low quality samples are determined separately with regard to each of the two methods of segmentation, anomDetectBAF and anomDetectLOH. BAF anomalies (respectively LOH anomalies) found for samples identified as low quality for BAF (respectively LOH) tend to have a high false positive rate.

A scan is identified as low quality due to high variance (noise), i.e. if med.sd is above a certain threshold sd.thresh.

High segmentation is often an indication of artifactual patterns in the B Allele Frequency (BAF) or Log R Ratio values (LRR) that are not always captured by high variance. Here segmentation information is determined by anomDetectBAF or anomDetectLOH which use circular binary segmentation implemented by the R-package DNAcopy. The measure for high segmentation is a "segmentation factor" = (number of segments)/(number of eligible SNPS). A single chromosome segmentation factor uses information for one chromosome. A segmentation factor across autosomes uses the total number of segments and eligible SNPs across all autosomes. See med.sd, sd.thresh, sng.seg.thresh, and auto.seg.thresh.

Value

A data.frame with the following columns:

scanID

integer id for the scan

chrX.num.segs

number of segments for chromosome X

chrX.fac

segmentation factor for chromosome X

max.autosome

autosome with highest single segmentation factor

max.auto.fac

segmentation factor for chromosome = max.autosome

max.auto.num.segs

number of segments for chromosome = max.autosome

num.ch.segd

number of chromosomes segmented, i.e. for which change points were found

fac.all.auto

segmentation factor across all autosomes

med.sd

median standard deviation of BAF (or LRR values) across autosomes. See med.sd in Arguments section.

type

one of the following, indicating reason for identification as low quality:

  • auto.seg: segmentation factor fac.all.auto above auto.seg.thresh but med.sd acceptable

  • sd: standard deviation factor med.sd above sd.thresh but fac.all.auto acceptable

  • both.sd.seg: both high variance and high segmentation factors, fac.all.auto and med.sd, are above respective thresholds

  • sng.seg: segmentation factor max.auto.fac is above sng.seg.thresh but other measures acceptable

  • sng.seg.X: segmentation factor chrX.fac is above sng.seg.thresh but other measures acceptable

Author(s)

Cecelia Laurie

See Also

findBAFvariance, anomDetectBAF, anomDetectLOH

Examples

library(GWASdata)
data(illuminaScanADF, illuminaSnpADF)

blfile <- system.file("extdata", "illumina_bl.gds", package="GWASdata")
bl <- GdsIntensityReader(blfile)
blData <-  IntensityData(bl, scanAnnot=illuminaScanADF, snpAnnot=illuminaSnpADF)

genofile <- system.file("extdata", "illumina_geno.gds", package="GWASdata")
geno <- GdsGenotypeReader(genofile)
genoData <-  GenotypeData(geno, scanAnnot=illuminaScanADF, snpAnnot=illuminaSnpADF)

# initial scan for low quality with median SD
baf.sd <- sdByScanChromWindow(blData, genoData)
med.baf.sd <- medianSdOverAutosomes(baf.sd)
low.qual.ids <- med.baf.sd$scanID[med.baf.sd$med.sd > 0.05]

# segment and filter BAF
scan.ids <- illuminaScanADF$scanID[1:2]
chrom.ids <- unique(illuminaSnpADF$chromosome)
snp.ids <- illuminaSnpADF$snpID[illuminaSnpADF$missing.n1 < 1]
data(centromeres.hg18)
anom <- anomDetectBAF(blData, genoData, scan.ids=scan.ids, chrom.ids=chrom.ids,
  snp.ids=snp.ids, centromere=centromeres.hg18, low.qual.ids=low.qual.ids)

# further screen for low quality scans
snp.annot <- illuminaSnpADF
snp.annot$eligible <- snp.annot$missing.n1 < 1
low.qual <- anomIdentifyLowQuality(snp.annot, med.baf.sd, anom$seg.info,
  sd.thresh=0.1, sng.seg.thresh=0.0008, auto.seg.thresh=0.0001)

close(blData)
close(genoData)

smgogarten/GWASTools documentation built on Nov. 10, 2024, 9:54 p.m.