duplicateDiscordance: Duplicate discordance
In smgogarten/SeqVarTools: Tools for variant data

duplicateDiscordance

R Documentation

Duplicate discordance

Description

Find discordance rate for duplicate sample pairs

Usage


## S4 method for signature 'SeqVarData,missing'
duplicateDiscordance(gdsobj, match.samples.on="subject.id", by.variant=FALSE,
    all.pairs=TRUE, verbose=TRUE)
## S4 method for signature 'SeqVarIterator,missing'
duplicateDiscordance(gdsobj, match.samples.on="subject.id", by.variant=FALSE,
    all.pairs=TRUE, verbose=TRUE)
## S4 method for signature 'SeqVarData,SeqVarData'
duplicateDiscordance(gdsobj, obj2, match.samples.on=c("subject.id", "subject.id"),
    match.variants.on=c("alleles", "position"),
    discordance.type=c("genotype", "hethom"),
    by.variant=FALSE, verbose=TRUE)

Arguments

`gdsobj`	A `SeqVarData` object with VCF data.
`obj2`	A `SeqVarData` object with VCF data.
`match.samples.on`	Character string or vector of strings indicating which column should be used for matching samples. See details.
`match.variants.on`	Character string of length one indicating how to match variants. See details.
`discordance.type`	Character string describing how discordances should be calculated. See details.
`by.variant`	Calculate discordance by variant, otherwise by sample
`all.pairs`	Logical for whether to include all possible pairs of samples (`all.pairs=TRUE`) or only the first pair per subject (`all.pairs=FALSE`).
`verbose`	A logical indicating whether to print progress messages.

Details

For calls that involve only one gds file, duplicate discordance is calculated by matching samples on common values of a column in sampleData. If all.pairs=TRUE, every possible pair of samples is included, so there may be multiple pairs per subject. If all.pairs=FALSE, only the first pair for each subject is used.

For calls that involve two gds files, duplicate discordance is calculated by matching sample pairs and variants between the two data sets. Only biallelic SNVs are considered in the comparison. Variants can be matched using chromosome and position only (match.variants.on="position") or by using chromosome, position, and alleles (match.variants.on="alleles"). If matching on alleles and the reference allele in the first dataset is the alternate allele in the second dataset, the genotype dosage will be recoded so the same allele is counted before making the comparison. If a variant in one dataset maps to multiple variants in the other dataset, only the first pair is considered for the comparison. Discordances can be calculated using either genotypes (discordance.type = "genotype") or heterozygote/homozygote status (discordance.type = "hethom"). The latter is a method to calculate discordance that does not require alleles to be measured on the same strand in both datasets, so it is probably best to also set match.variants.on = "position" if using the "hethom" option.

The argument match.samples.on can be used to select which column in the sampleData of the input SeqVarData object should be used for matching samples. For one gds file, match.samples.on should be a single string. For two gds files, match.samples.on should be a length-2 vector of character strings, where the first element is the column to use for the first gds object and the second element is the column to use for the second gds file.

To exclude certain variants or samples from the calculate, use seqSetFilter to set appropriate filters on each gds object.

Value

A data frame with the following columns, depending on whether by.variant=TRUE or FALSE:

`subject.id`	currently, this is the sample ID (`by.variant=FALSE` only)
`sample.id.1/variant.id.1`	sample id or variant id in the first gds file
`sample.id.2/variant.id.2`	sample id or variant id in the second gds file
`n.variants/n.samples`	the number of non-missing variants or samples that were compared
`n.concordant`	the number of concordant variants
`n.alt`	the number of variants involving the alternate allele in either sample
`n.alt.conc`	the number of concordant variants invovling the alternate allele in either sample
`n.het.ref`	the number of mismatches where one call is a heterozygote and the other is a reference homozygote
`n.het.alt`	the number of mismatches where one call is a heterozygote and the other is an alternate homozygote
`n.ref.alt`	the number of mismatches where the calls are opposite homozygotes

Author(s)

Stephanie Gogarten, Adrienne Stilp

Examples

require(Biobase)

gds <- seqOpen(seqExampleFileName("gds"))

## the example file has one sample per subject, but we
## will match the first four samples into pairs as an example
sample.id <- seqGetData(gds, "sample.id")
samples <- AnnotatedDataFrame(data.frame(data.frame(subject.id=rep(c("subj1", "subj2"), times=45),
                      sample.id=sample.id,
                      stringsAsFactors=FALSE)))
seqData <- SeqVarData(gds, sampleData=samples)

## set a filter on the first four samples
seqSetFilter(seqData, sample.id=sample.id[1:4])

disc <- duplicateDiscordance(seqData, by.variant=FALSE)
disc
disc <- duplicateDiscordance(seqData, by.variant=TRUE)
head(disc)

## recommended to use an iterator object for large datasets
iterator <- SeqVarBlockIterator(seqData)
disc <- duplicateDiscordance(iterator, by.variant=FALSE)
disc

seqClose(gds)

smgogarten/SeqVarTools documentation built on Sept. 15, 2024, 1:08 p.m.