Clustering and calling of genotypes

Share:

Description

Clusters each marker into the most likely combination from a sequence of allowed cluster combinations. These may include multisite variants from polyploid genomes

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
setGenoOptions(largeSample = FALSE, snpPerArrayLim = 0.8,
    arrayPerSnpLim = 0, ploidy = "tetra",
    filterLim = 0, detectLim = 0.8,
    wSpreadLim = suggestGeno(largeSample)$wSpreadLim,
    devCentLim = 0.35,
    hwAlpha = suggestGeno(largeSample)$hwAlpha,
    probsIndSE = suggestGeno(largeSample)$probsIndSE,
    afList = seq(0, 0.5, 0.05),
    clAlpha = suggestGeno(largeSample)$clAlpha,
    rPenalty = 2,
    rotationLim = suggestGeno(largeSample)$rotationLim,
    minClLim = 5, nSdOverlap = 2,
    minBin = suggestGeno(largeSample)$minBin,
    binWidth = suggestGeno(largeSample)$binWidth) 

callGenotypes(BSRed,
    gO = setGenoOptions(largeSample = ncol(BSRed) > 250))

callGenotypes.interactive(BSRed,
    gO = setGenoOptions(largeSample = ncol(BSRed) > 250))

callGenotypes.verboseTest(BSRed, singleMarker = 1,
    gO = setGenoOptions(largeSample = ncol(BSRed) > 250))

Arguments

largeSample

Logical indicating whether or not a large sample is genotyped, which influences the values of some default genotyping options. More than 2-300 arays may be considered large

snpPerArrayLim

Minimum ratio of non-NA markers per array. Failing arrays are disregarded

arrayPerSnpLim

Minimum ratio of non-NA arrays per marker. Failing markers are disregarded

ploidy

Character string indicating which genotyping classes to allow (see generatePolyCenters)

filterLim

Markers with range of assayData “theta” below this value are classified as “MONO-filt” and are not subjected to clustering. This option will speed up the analysis, however the calls and the statistics of classification will be missing for these markers

detectLim

Ratio of called arrays below which markers are classified as “FAIL”

wSpreadLim

The maximum allowed within cluster standard deviation along assayData “theta”

devCentLim

The maximum allowed difference in “theta” between expected and estimated cluster centres

hwAlpha

Significance level used in Hardy-Weinberg testing. Used to detect where the clustering fails, and a low value (e.g. 1e-10 - 1e-40) may be used to allow natural deviation from HW. This criterion should be used with caution, especially when the sample contains animals from different populations

probsIndSE

Numeric quantile for which markers with higher standard errors are excluded from the clustering step (however not discarded)

afList

Numeric vector of values within [0, 0.5], denoting which B allele frequencies to test in the case of two segregating paralogs (see below)

clAlpha

Numeric significance level for Hotelling's T^2 test (see below)

rPenalty

Scaling for the ordinate axis spanned by assayData “intensity” (see below). A high value means the influence of the intensity in the clustering is decreased

rotationLim

Controls the maximum allowed tilt of clusters, as defined by Hotelling's ellipses (see below)

minClLim

Clusters below this minimum size are disregarded in the Hotelling's test (all animals included in cluster)

nSdOverlap

Numeric multiple of assayData “theta” standard deviations defining within cluster spread

minBin

When estimating initial center points for the clustering, histogram peaks below this numeric value are set to zero

binWidth

Starting value for histogram bin-width used to find initial center points. Smaller clusters are detected with smaller bin-width

BSRed

"AlleleSetIllumina" object, holding a required phenoData column “noiseIntensity” (see preprocessBeadSet)

gO

List of options and cut-off limits used in the clustering (see setGenoOptions)

singleMarker

Index to a specific marker to test

Details

Initially, arrays with a fraction of non-NA markers less than gO$snpPerArrayLim, and markers with a fraction of non-NA arrays less than gO$arrayPerSnpLim, are discarded from the analysis. Also, markers with range of assayData “theta” below gO$filterLim are called “MONO-filt” and are not subjected to clustering.

Genotype calling for each marker is based on k-means clustering in the two dimensions defined by assayData entries “intensity” and “theta”, where “intensity” is penalized by gO$rPenalty times its median value. The value of gO$ploidy sets the allowed cluster combinations through the function generatePolyCenters. Data points below the average value of the phenoData column “noiseIntensity” across included arrays are set to missing, and arrays with standard errors exceeding the quantile given by gO$probsIndSE are left out from the clustering step. Two or three of the most likely cluster combinations are estimated with the function getCenters. This function uses histograms to find starting values for the clustering, with input parameters gO$minBin defining a peak-height filter and gO$binWidth setting an initial bin-width. The ranked cluster combinations are tested in turn until one is found which passes the criteria below.

First, gO$devCentLim controls the maximum distance a cluster is allowed to deviate from its expected or ideal position. Second, gO$wSpreadLim limits how large the within cluster standard deviation can be. Both these criteria are tested in the “theta”-dimension, and if either fails, the algorithm will attempt to cluster the next most likely configuration and start over.

The next test is for Hardy-Weinberg (HW) equilibrium. This is a very powerful test to detect if the clustering has failed, given HW can be assumed. Otherwise, set the significance value gO$hwAlpha to zero to allow any deviation from HW. At duplicated loci, the observed B allele-frequency (BAF) is in fact the mean BAF across both paralogues. Several candidate values of BAF for one paralogue are set in gO$afList, such that the BAF for the other paralogue is given. Several values of gO$afList within [0, 0.5] are tested for HW, and the most likely BAF at both paralogs are those resulting in the highest p-value. Another, somewhat less powerful, quality control test is to look for overlapping clusters. If a cluster centre on the “theta”-axis is denoted Cent, and its spread is given by Within.SD, clusters are tested for overlap using Cent +/- gO$nSdOverlap * Within.SD

Markers passing the quality control are subjected to a Hotelling's T^2-test (Hotelling, 1931; Gidskehaug et al., 2007), which effectively superimposes an ellipse on each cluster and discards all data points falling outside its boundaries or within overlapping ellipses. Due to inaccurate ellipses for small clusters, only clusters with more than gO$minClLim members are tested. The extents of the clusters are controlled by the significance level gO$clAlpha of the T^2-tests in such a way that a small value yield large ellipses. The orientation of an ellipse is defined by the ratio of variances in the variance-/covariance matrix from the Hotelling's test. If the ratio of the “theta”-variance to the “intensity”-variance is given by Sratio, the weighted sum of Sratio across clusters is not allowed to exceed gO$rotationLim. Finally, the call rate is required to exceed gO$detectLim, or the algorithm will move on to the next most likely cluster combination. If a marker passes all the tests for one of the candidate cluster combinations, the genotype is called. Otherwise the marker is failed (i.e. called “FAIL”).

callGenotypes is the main genotype calling function. Before genotyping thousands of markers, it is important spend a little time tuning the options in gO. A few hundred markers may be called using the default settings, and plotted using plotGenotypes. Individual markers whith questionable calls can be investigated more closely using callGenotypes.verboseTest. This function performs all the tests described above, for all suggested cluster combinations, without failing any of the tests. Rather, all the test results are plotted and returned, revealing which steps can be taken to improve the clustering, if any. In these plots, green dots are the estimated centre points, and red and orange dots represent arrays outside cluster boundaries or within overlapping ellipses, respectively. Several markers representing each of the possible genotype categories should be investigated in this way before commencing with genotyping the full set of markers.

For some markers with highly overlapping or inaccurate clusters, interactive clustering might be the last option. This can be performed using the function callGenotypes.interactive (requires package rggobi to be installed). This function will loop through a smaller set of markers, allowing the user to define clusters manually. Pedigree checking is performed inside the function, and a phenoData column “PedigreeID” is required (see validateCallsPedigree). Use the function assignToAlleleSet to incorporate the manual results into a larger "AlleleSetIllumina" object.

Value

Output from setGenoOptions is a list with all defined or suggested options needed for genotype calling

Output from callGenotypes is an "AlleleSetIllumina" object with an extra assayData entry:

call

Matrix with numeric values {0, 1/4, 1/2, 3/4, 1}, each indicating the ratio of B alleles for a marker in a sample


The "AlleleSetIllumina" output from callGenotypes.interactive in addition holds the assayData entries:

ped.check

Logical matrix with FALSE indicating pedigree violations among offspring. Parental values are all NA (see validateCallsPedigree)

ped.check.parents

Matrix of numeric values {0, 1, 2}, denoting no error, offspring error, and parent error, respectively (see validateSingleCall)


The following columns are added to featureData:

Classification

Genotype call from automatic clustering

Cent.Deviation

Largest distance from cluster-centre to its ideal position

Within.SD

Largest within-cluster spread

HW.Chi2

Chi-squared statistic from test of Hardy-Weinberg equilibrium

HW.P

Probability of Hardy-Weinberg equilibrium

BAF.Locus1

Estimated B-allele frequency

BAF.Locus2

Estimated BAF of second paralogue (if exists)

Call.Rate

Ratio of arrays assigned to clusters

Manual.Calls.R

Calls from interactive clustering (if applied)


Output from callGenotypes.verboseTest is a list containing

call

The calls from each attempted cluster-combination

fData

Statistics from each cluster-combination

fMetadata

Explanation to each statistic

test

Table of “PASS” or “FAIL” for each test

Note

When pedigree data are available, pedigree checking is a very strong type of validation which is performed after the genotype calling in callGenotypes

In callGenotypes.interactive, pedigree checking is performed during genotype calling. This means the pedigree cannot subsequently be used as validation to the same degree as if it had been kept secret during genotype calling

Author(s)

Lars Gidskehaug

References

L. Gidskehaug, E. Anderssen, A. Flatberg, and B. K. Alsberg (2007) A framework for significance analysis of gene expression data using dimension reduction methods. BMC Bioinformatics 8:346

H. Hotelling (1931) The Generalization of Student's Ratio. Ann. Math. Stat. 2(3):360-378

See Also

generatePolyCenters, getCenters, findPolyploidClusters, testHardyWeinberg, ggobi, plotGenotypes, validateCallsPedigree, assignToAlleleSet, manualCall

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
## Not run: 
#Read 25 markers into an AlleleSetIllumina object
rPath <- system.file("extdata", package="beadarrayMSV")
normOpts <- setNormOptions()
dataFiles <- makeFilenames('testdata',normOpts,rPath)
beadFile <- paste(rPath,'beadData_testdata.txt',sep='/')
beadInfo <- read.table(beadFile,sep='\t',header=TRUE,as.is=TRUE)
BSRed <- createAlleleSetFromFiles(dataFiles[1:4],markers=1:25,beadInfo=beadInfo)

#Genotype calling and validation
gO <- setGenoOptions(largeSample=TRUE)
BSRed <- callGenotypes(BSRed, gO=gO)
BSRed <- validateCallsPedigree(BSRed)
sumClass <- tapply(rep(1,nrow(BSRed)),fData(BSRed)$Classification,sum)
print(sumClass)

#Plot default setting calls
plotGenotypes(BSRed)

#Tune settings to call an initially failed marker
dev.new()
verboseRes <- callGenotypes.verboseTest(BSRed, gO=gO, singleMarker=23)
print(verboseRes$fData)
print(verboseRes$test)
gO <- setGenoOptions(largeSample=TRUE, wSpreadLim=.1, hwAlpha=1e-50)
verboseRes <- callGenotypes.verboseTest(BSRed, gO=gO, singleMarker=23)

#New settings give (likely incorrect) SNP-call
BSRed <- callGenotypes(BSRed, gO=gO)
BSRed <- validateCallsPedigree(BSRed)
dev.new()
plotGenotypes(BSRed)

## End(Not run)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.