callGenotypes: Clustering and calling of genotypes
In beadarrayMSV: Analysis of Illumina BeadArray SNP data including MSV markers

Description Usage Arguments Details Value Note Author(s) References See Also Examples

Clusters each marker into the most likely combination from a sequence of allowed cluster combinations. These may include multisite variants from polyploid genomes

setGenoOptions(largeSample = FALSE, snpPerArrayLim = 0.8,
    arrayPerSnpLim = 0, ploidy = "tetra",
    filterLim = 0, detectLim = 0.8,
    wSpreadLim = suggestGeno(largeSample)$wSpreadLim,
    devCentLim = 0.35,
    hwAlpha = suggestGeno(largeSample)$hwAlpha,
    probsIndSE = suggestGeno(largeSample)$probsIndSE,
    afList = seq(0, 0.5, 0.05),
    clAlpha = suggestGeno(largeSample)$clAlpha,
    rPenalty = 2,
    rotationLim = suggestGeno(largeSample)$rotationLim,
    minClLim = 5, nSdOverlap = 2,
    minBin = suggestGeno(largeSample)$minBin,
    binWidth = suggestGeno(largeSample)$binWidth) 

callGenotypes(BSRed,
    gO = setGenoOptions(largeSample = ncol(BSRed) > 250))

callGenotypes.interactive(BSRed,
    gO = setGenoOptions(largeSample = ncol(BSRed) > 250))

callGenotypes.verboseTest(BSRed, singleMarker = 1,
    gO = setGenoOptions(largeSample = ncol(BSRed) > 250))

`largeSample`	Logical indicating whether or not a large sample is genotyped, which influences the values of some default genotyping options. More than 2-300 arays may be considered large
`snpPerArrayLim`	Minimum ratio of non-NA markers per array. Failing arrays are disregarded
`arrayPerSnpLim`	Minimum ratio of non-NA arrays per marker. Failing markers are disregarded
`ploidy`	Character string indicating which genotyping classes to allow (see `generatePolyCenters`)
`filterLim`	Markers with range of `assayData` “theta” below this value are classified as “MONO-filt” and are not subjected to clustering. This option will speed up the analysis, however the calls and the statistics of classification will be missing for these markers
`detectLim`	Ratio of called arrays below which markers are classified as “FAIL”
`wSpreadLim`	The maximum allowed within cluster standard deviation along `assayData` “theta”
`devCentLim`	The maximum allowed difference in “theta” between expected and estimated cluster centres
`hwAlpha`	Significance level used in Hardy-Weinberg testing. Used to detect where the clustering fails, and a low value (e.g. 1e-10 - 1e-40) may be used to allow natural deviation from HW. This criterion should be used with caution, especially when the sample contains animals from different populations
`probsIndSE`	Numeric quantile for which markers with higher standard errors are excluded from the clustering step (however not discarded)
`afList`	Numeric vector of values within [0, 0.5], denoting which B allele frequencies to test in the case of two segregating paralogs (see below)
`clAlpha`	Numeric significance level for Hotelling's T^2 test (see below)
`rPenalty`	Scaling for the ordinate axis spanned by `assayData` “intensity” (see below). A high value means the influence of the intensity in the clustering is decreased
`rotationLim`	Controls the maximum allowed tilt of clusters, as defined by Hotelling's ellipses (see below)
`minClLim`	Clusters below this minimum size are disregarded in the Hotelling's test (all animals included in cluster)
`nSdOverlap`	Numeric multiple of `assayData` “theta” standard deviations defining within cluster spread
`minBin`	When estimating initial center points for the clustering, histogram peaks below this numeric value are set to zero
`binWidth`	Starting value for histogram bin-width used to find initial center points. Smaller clusters are detected with smaller bin-width
`BSRed`	`"AlleleSetIllumina"` object, holding a required `phenoData` column “noiseIntensity” (see `preprocessBeadSet`)
`gO`	List of options and cut-off limits used in the clustering (see `setGenoOptions`)
`singleMarker`	Index to a specific marker to test

Initially, arrays with a fraction of non-NA markers less than gO$snpPerArrayLim, and markers with a fraction of non-NA arrays less than gO$arrayPerSnpLim, are discarded from the analysis. Also, markers with range of assayData “theta” below gO$filterLim are called “MONO-filt” and are not subjected to clustering.

Genotype calling for each marker is based on k-means clustering in the two dimensions defined by assayData entries “intensity” and “theta”, where “intensity” is penalized by gO$rPenalty times its median value. The value of gO$ploidy sets the allowed cluster combinations through the function generatePolyCenters. Data points below the average value of the phenoData column “noiseIntensity” across included arrays are set to missing, and arrays with standard errors exceeding the quantile given by gO$probsIndSE are left out from the clustering step. Two or three of the most likely cluster combinations are estimated with the function getCenters. This function uses histograms to find starting values for the clustering, with input parameters gO$minBin defining a peak-height filter and gO$binWidth setting an initial bin-width. The ranked cluster combinations are tested in turn until one is found which passes the criteria below.

First, gO$devCentLim controls the maximum distance a cluster is allowed to deviate from its expected or ideal position. Second, gO$wSpreadLim limits how large the within cluster standard deviation can be. Both these criteria are tested in the “theta”-dimension, and if either fails, the algorithm will attempt to cluster the next most likely configuration and start over.

The next test is for Hardy-Weinberg (HW) equilibrium. This is a very powerful test to detect if the clustering has failed, given HW can be assumed. Otherwise, set the significance value gO$hwAlpha to zero to allow any deviation from HW. At duplicated loci, the observed B allele-frequency (BAF) is in fact the mean BAF across both paralogues. Several candidate values of BAF for one paralogue are set in gO$afList, such that the BAF for the other paralogue is given. Several values of gO$afList within [0, 0.5] are tested for HW, and the most likely BAF at both paralogs are those resulting in the highest p-value. Another, somewhat less powerful, quality control test is to look for overlapping clusters. If a cluster centre on the “theta”-axis is denoted Cent, and its spread is given by Within.SD, clusters are tested for overlap using Cent +/- gO$nSdOverlap * Within.SD

Markers passing the quality control are subjected to a Hotelling's T^2-test (Hotelling, 1931; Gidskehaug et al., 2007), which effectively superimposes an ellipse on each cluster and discards all data points falling outside its boundaries or within overlapping ellipses. Due to inaccurate ellipses for small clusters, only clusters with more than gO$minClLim members are tested. The extents of the clusters are controlled by the significance level gO$clAlpha of the T^2-tests in such a way that a small value yield large ellipses. The orientation of an ellipse is defined by the ratio of variances in the variance-/covariance matrix from the Hotelling's test. If the ratio of the “theta”-variance to the “intensity”-variance is given by Sratio, the weighted sum of Sratio across clusters is not allowed to exceed gO$rotationLim. Finally, the call rate is required to exceed gO$detectLim, or the algorithm will move on to the next most likely cluster combination. If a marker passes all the tests for one of the candidate cluster combinations, the genotype is called. Otherwise the marker is failed (i.e. called “FAIL”).

callGenotypes is the main genotype calling function. Before genotyping thousands of markers, it is important spend a little time tuning the options in gO. A few hundred markers may be called using the default settings, and plotted using plotGenotypes. Individual markers whith questionable calls can be investigated more closely using callGenotypes.verboseTest. This function performs all the tests described above, for all suggested cluster combinations, without failing any of the tests. Rather, all the test results are plotted and returned, revealing which steps can be taken to improve the clustering, if any. In these plots, green dots are the estimated centre points, and red and orange dots represent arrays outside cluster boundaries or within overlapping ellipses, respectively. Several markers representing each of the possible genotype categories should be investigated in this way before commencing with genotyping the full set of markers.

For some markers with highly overlapping or inaccurate clusters, interactive clustering might be the last option. This can be performed using the function callGenotypes.interactive (requires package rggobi to be installed). This function will loop through a smaller set of markers, allowing the user to define clusters manually. Pedigree checking is performed inside the function, and a phenoData column “PedigreeID” is required (see validateCallsPedigree). Use the function assignToAlleleSet to incorporate the manual results into a larger "AlleleSetIllumina" object.

Output from setGenoOptions is a list with all defined or suggested options needed for genotype calling

Output from callGenotypes is an "AlleleSetIllumina" object with an extra assayData entry:

call

Matrix with numeric values {0, 1/4, 1/2, 3/4, 1}, each indicating the ratio of B alleles for a marker in a sample

The "AlleleSetIllumina" output from callGenotypes.interactive in addition holds the assayData entries:

`ped.check`	Logical matrix with `FALSE` indicating pedigree violations among offspring. Parental values are all `NA` (see `validateCallsPedigree`)
`ped.check.parents`	Matrix of numeric values {0, 1, 2}, denoting no error, offspring error, and parent error, respectively (see `validateSingleCall`)

The following columns are added to featureData:

`Classification`	Genotype call from automatic clustering
`Cent.Deviation`	Largest distance from cluster-centre to its ideal position
`Within.SD`	Largest within-cluster spread
`HW.Chi2`	Chi-squared statistic from test of Hardy-Weinberg equilibrium
`HW.P`	Probability of Hardy-Weinberg equilibrium
`BAF.Locus1`	Estimated B-allele frequency
`BAF.Locus2`	Estimated BAF of second paralogue (if exists)
`Call.Rate`	Ratio of arrays assigned to clusters
`Manual.Calls.R`	Calls from interactive clustering (if applied)

Output from callGenotypes.verboseTest is a list containing

`call`	The calls from each attempted cluster-combination
`fData`	Statistics from each cluster-combination
`fMetadata`	Explanation to each statistic
`test`	Table of “PASS” or “FAIL” for each test

When pedigree data are available, pedigree checking is a very strong type of validation which is performed after the genotype calling in callGenotypes

In callGenotypes.interactive, pedigree checking is performed during genotype calling. This means the pedigree cannot subsequently be used as validation to the same degree as if it had been kept secret during genotype calling

Lars Gidskehaug

L. Gidskehaug, E. Anderssen, A. Flatberg, and B. K. Alsberg (2007) A framework for significance analysis of gene expression data using dimension reduction methods. BMC Bioinformatics 8:346

H. Hotelling (1931) The Generalization of Student's Ratio. Ann. Math. Stat. 2(3):360-378

generatePolyCenters, getCenters, findPolyploidClusters, testHardyWeinberg, ggobi, plotGenotypes, validateCallsPedigree, assignToAlleleSet, manualCall

## Not run: 
#Read 25 markers into an AlleleSetIllumina object
rPath <- system.file("extdata", package="beadarrayMSV")
normOpts <- setNormOptions()
dataFiles <- makeFilenames('testdata',normOpts,rPath)
beadFile <- paste(rPath,'beadData_testdata.txt',sep='/')
beadInfo <- read.table(beadFile,sep='\t',header=TRUE,as.is=TRUE)
BSRed <- createAlleleSetFromFiles(dataFiles[1:4],markers=1:25,beadInfo=beadInfo)

#Genotype calling and validation
gO <- setGenoOptions(largeSample=TRUE)
BSRed <- callGenotypes(BSRed, gO=gO)
BSRed <- validateCallsPedigree(BSRed)
sumClass <- tapply(rep(1,nrow(BSRed)),fData(BSRed)$Classification,sum)
print(sumClass)

#Plot default setting calls
plotGenotypes(BSRed)

#Tune settings to call an initially failed marker
dev.new()
verboseRes <- callGenotypes.verboseTest(BSRed, gO=gO, singleMarker=23)
print(verboseRes$fData)
print(verboseRes$test)
gO <- setGenoOptions(largeSample=TRUE, wSpreadLim=.1, hwAlpha=1e-50)
verboseRes <- callGenotypes.verboseTest(BSRed, gO=gO, singleMarker=23)

#New settings give (likely incorrect) SNP-call
BSRed <- callGenotypes(BSRed, gO=gO)
BSRed <- validateCallsPedigree(BSRed)
dev.new()
plotGenotypes(BSRed)

## End(Not run)