compareProbes: Compare and combine results from two probes for the same SNP
In fitPoly: Genotype Calling for Bi-Allelic Marker Assays

compareProbes

R Documentation

Compare and combine results from two probes for the same SNP

Description

On Affymetrix Axiom arrays it is possible to have two probes interrogating the same SNP position. This function compares the dosage scores and checkF1 results of the two probes; if they are sufficiently similar a new marker is generated combining the results of the two probes. A dosage file with the data for the separate probes as well as the combined markers is written with the same format as writeDosagefile, and also a file summarizing the comparison results.

Usage

compareProbes(chk, scores,
probe.suffix=c("P","Q","R"), fracdiff.threshold=0.04,
parent1, parent2, F1, ancestors=character(0), other=character(0),
polysomic=TRUE, disomic=FALSE, mixed=FALSE,
ploidy, ploidy2, qall_flavor="qall_mult", shiftParents,
compfile, combscorefile)

Arguments

`chk`	data frame as returned by checkF1, or a subset with at least columns markername, parent1, parent2 (the consensus parental genotypes), the columns for the samples specified by parameters parent1, parent2 and ancestors, and bestParentfit, and containing only rows with selected markers. If a column with a name as specified by qall_flavor (see below) is present this will be written to file compfile, but it is not used: any selection of marker based on qall (or other) must have been made beforehand, and the rows for the unwanted markers must have been deleted from the chk data frame. For each marker*probe combination there may be an unshifted version (shift==0), a shifted one (shift!=0), both, or neither. If a column shift is present it will be used to shift the dosages (and their P-values with them). If some markernames end in "_shf" this part will be ignored, but the P and Q suffixes (or alternatives as specified by probe.suffix) are required to distinguish the two probes.
`scores`	data frame as read from the scores file produced by function fitMarkers of package fitPoly, with at least columns MarkerName, SampleName, P0 .. P<ploidyF1> and geno (where <ploidyF1> is the ploidy of the F1, i.e. the average of parental ploidy and ploidy2). If the F1 parents are scored separately, their rows should be added to the scores data.frame for the F1 samples. If their ploidy is different from the F1, the number of their P columns must be adjusted. The P data of the parents are not used, they may all be set to NA.
`probe.suffix`	a 3-item character vector specifying the suffixes of the marker names that distinguish the two probes. The first two items identify the two probes; the third item is used to indicate a new marker combining the data from both probes. The three items must be different and have the same number of characters default is c("P","Q","R")
`fracdiff.threshold`	if more than this fraction of F1 scores differs between probes, don't combine
`parent1`	character vector with the sample names of parent 1
`parent2`	character vector with the sample names of parent 2
`F1`	character vector with the sample names of the F1 individuals
`ancestors`	character vector with the sample names of any other ancestors
`other`	other samples that should be treated like the F1
`polysomic`	TRUE or FALSE; should be the same as used by checkF1 to calculate the chk data frame
`disomic`	TRUE or FALSE; should be the same as used by checkF1 to calculate the chk data frame
`mixed`	TRUE or FALSE; should be the same as used by checkF1 to calculate the chk data frame
`ploidy`	the ploidy of parent 1 (must be even, 2 (diploid) or larger), and the same as used by checkF1 to calculate the chk data frame
`ploidy2`	the ploidy of parent 2. If omitted it is assumed to be equal to ploidy. Should be the same as used by checkF1 to calculate the chk data frame
`qall_flavor`	which quality parameter column must be shown in compfile, default "qall_mult". If no quality data are wanted, specify "".
`shiftParents`	if there is a column shift in chk the F1 dosages will be shifted. If shiftParents is TRUE the parents and ancestors will be shifted together with the F1, if FALSE only the F1 will be shifted in that case. If shiftParents is missing or NA it will be set to TRUE except if ploidy2 != ploidy: in that case this will result in an error (because it may be that the parents are not genotyped or scored together with the F1, the user should specify explicitly what to do)
`compfile`	filename for tab-separated text file summarizing the comparison results; if NA no file is written. For details of the contents see the return value, component compstat
`combscorefile`	filename for tab-separated text file with the dosages; if NA no file is written. For details of the contents see the return value, component combscores

Details

A combined marker is made in each case that a version of each of the two probe markers is present and they are sufficiently similar. This means that they have been assigned the same bestParentfit segregation type by checkF1, and that the frequency of conflicting scores over all samples is not more than fracdiff.threshold. The combined marker will have NA scores for individuals where both probe markers are missing, the one available score if it is scored for only one of the two probe markers or both scores are equal, and the score with the highest P-value if the scores for both probe markers are unequal.
Any single-probe markers in chk that do not have a bestParentfit segregation type are ignored and will not affect or appear in the output.

Value

A list with two components, compstat and combscores.
compstat is a data frame with columns:

MarkerName: name of the SNP marker. If a column shift is present in data.frame chk, unshifted and shifted markers will get a "n" or "s" suffixed to the MarkerName
segtypeP and segtypeQ: the segtype assigned by checkF1 to the first and second probe
qallP and qallQ: the quality scores specified by parameter qall_flavor, assigned by checkF1 to the two probes
countP and countQ: the number of versions of each of the probes (0, 1, or 2, depending on whether a shifted, unshifted or both versions were present)
countR: the number of combinations made of versions of the two probe markers (one for each combination of a version of each of the two probe markers, if they match well enough - see details)

If the chk data frame contains a column shift, there are separate columns for the non-shifted and shifted P and Q probe markers (suffix Pn, Ps, Qn, Qs), and four columns for the R markers (suffix Rnn, Rns, Rsn, Rss where the first n/s indicates if the P was non-shifted or shifted and the second n/s for the Q probe. combscores is a data frame with columns:

MarkerName: the name of the marker. If the chk data frame contains a column shift, the P and Q marker names are suffixed with n or s, and the R marker names with nn, ns, sn, ss as described above
segtype: the segregation type
parental and ancestor samples: the dosages of those samples
parent1: the consensus dosage for parent1 as determined by checkF1
parent2: the consensus dosage for parent2 as determined by checkF1
F1 samples: the dosages for those samples
other samples: the dosages for those samples