Description Usage Arguments Details Value Note Author(s) References See Also Examples
Clusters each marker into the most likely combination from a sequence of allowed cluster combinations. These may include multisite variants from polyploid genomes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | setGenoOptions(largeSample = FALSE, snpPerArrayLim = 0.8,
arrayPerSnpLim = 0, ploidy = "tetra",
filterLim = 0, detectLim = 0.8,
wSpreadLim = suggestGeno(largeSample)$wSpreadLim,
devCentLim = 0.35,
hwAlpha = suggestGeno(largeSample)$hwAlpha,
probsIndSE = suggestGeno(largeSample)$probsIndSE,
afList = seq(0, 0.5, 0.05),
clAlpha = suggestGeno(largeSample)$clAlpha,
rPenalty = 2,
rotationLim = suggestGeno(largeSample)$rotationLim,
minClLim = 5, nSdOverlap = 2,
minBin = suggestGeno(largeSample)$minBin,
binWidth = suggestGeno(largeSample)$binWidth)
callGenotypes(BSRed,
gO = setGenoOptions(largeSample = ncol(BSRed) > 250))
callGenotypes.interactive(BSRed,
gO = setGenoOptions(largeSample = ncol(BSRed) > 250))
callGenotypes.verboseTest(BSRed, singleMarker = 1,
gO = setGenoOptions(largeSample = ncol(BSRed) > 250))
|
largeSample |
Logical indicating whether or not a large sample is genotyped, which influences the values of some default genotyping options. More than 2-300 arays may be considered large |
snpPerArrayLim |
Minimum ratio of non-NA markers per array. Failing arrays are disregarded |
arrayPerSnpLim |
Minimum ratio of non-NA arrays per marker. Failing markers are disregarded |
ploidy |
Character string indicating which genotyping classes to allow (see
|
filterLim |
Markers with range of |
detectLim |
Ratio of called arrays below which markers are classified as “FAIL” |
wSpreadLim |
The maximum allowed within cluster standard deviation along
|
devCentLim |
The maximum allowed difference in “theta” between expected and estimated cluster centres |
hwAlpha |
Significance level used in Hardy-Weinberg testing. Used to detect where the clustering fails, and a low value (e.g. 1e-10 - 1e-40) may be used to allow natural deviation from HW. This criterion should be used with caution, especially when the sample contains animals from different populations |
probsIndSE |
Numeric quantile for which markers with higher standard errors are excluded from the clustering step (however not discarded) |
afList |
Numeric vector of values within [0, 0.5], denoting which B allele frequencies to test in the case of two segregating paralogs (see below) |
clAlpha |
Numeric significance level for Hotelling's T^2 test (see below) |
rPenalty |
Scaling for the ordinate axis spanned by |
rotationLim |
Controls the maximum allowed tilt of clusters, as defined by Hotelling's ellipses (see below) |
minClLim |
Clusters below this minimum size are disregarded in the Hotelling's test (all animals included in cluster) |
nSdOverlap |
Numeric multiple of |
minBin |
When estimating initial center points for the clustering, histogram peaks below this numeric value are set to zero |
binWidth |
Starting value for histogram bin-width used to find initial center points. Smaller clusters are detected with smaller bin-width |
BSRed |
|
gO |
List of options and cut-off limits used in the clustering (see
|
singleMarker |
Index to a specific marker to test |
Initially, arrays with a fraction of non-NA markers less
than gO$snpPerArrayLim
, and markers with a fraction of non-NA
arrays less than gO$arrayPerSnpLim
, are discarded from the
analysis. Also, markers with range of assayData
“theta”
below gO$filterLim
are called “MONO-filt” and are not
subjected to clustering.
Genotype calling for each marker is based on k-means
clustering in the two dimensions defined by assayData
entries
“intensity” and “theta”, where “intensity” is
penalized by gO$rPenalty
times its median value. The value of
gO$ploidy
sets the allowed cluster combinations through the function
generatePolyCenters
. Data points below the average value
of the phenoData
column “noiseIntensity” across included
arrays are set to missing, and arrays with standard errors exceeding
the quantile given by gO$probsIndSE
are left out from the clustering
step. Two or three of the most likely cluster combinations are
estimated with the function getCenters
. This function
uses histograms to find starting values for the clustering, with input
parameters gO$minBin
defining a peak-height filter and
gO$binWidth
setting an initial bin-width. The ranked
cluster combinations are tested in turn until one is found which
passes the criteria below.
First, gO$devCentLim
controls the maximum distance a cluster is allowed
to deviate from its expected or ideal position. Second, gO$wSpreadLim
limits how large the within cluster standard deviation can be. Both
these criteria are tested in the “theta”-dimension, and if either fails,
the algorithm will attempt to cluster the next most likely
configuration and start over.
The next test is for Hardy-Weinberg (HW)
equilibrium. This is a very powerful test to detect if the clustering
has failed, given HW can be assumed. Otherwise, set the significance value
gO$hwAlpha
to zero to allow any deviation from HW. At
duplicated loci, the observed B allele-frequency (BAF) is in
fact the mean BAF across both paralogues. Several candidate values
of BAF for one paralogue are set in gO$afList
, such that
the BAF for the other paralogue is given. Several values of
gO$afList
within [0, 0.5] are tested for HW, and the most
likely BAF at both paralogs are those resulting in the highest p-value.
Another, somewhat less powerful, quality control test is to look for
overlapping clusters. If a cluster centre on the “theta”-axis is
denoted Cent
, and its spread is given by Within.SD
,
clusters are tested for overlap using Cent +/- gO$nSdOverlap * Within.SD
Markers passing the quality control
are subjected to a Hotelling's T^2-test (Hotelling, 1931; Gidskehaug et
al., 2007), which effectively superimposes an ellipse on each cluster
and discards all data points falling outside its boundaries or within
overlapping ellipses. Due to inaccurate ellipses for small clusters,
only clusters with more than gO$minClLim
members are
tested. The extents of the clusters are controlled by the significance
level gO$clAlpha
of the T^2-tests in such a way that a
small value yield large ellipses. The orientation of an ellipse is
defined by the ratio of variances in the variance-/covariance matrix
from the Hotelling's test. If the ratio of the “theta”-variance
to the “intensity”-variance is given by Sratio
, the
weighted sum of Sratio
across clusters is not allowed to exceed
gO$rotationLim
. Finally, the call rate is required to exceed gO$detectLim
, or
the algorithm will move on to the next most likely cluster combination. If a
marker passes all the tests for one of the candidate cluster
combinations, the genotype is called. Otherwise the marker is failed
(i.e. called “FAIL”).
callGenotypes
is the main genotype calling function. Before
genotyping thousands of markers, it is important spend a little time
tuning the options in gO
. A few hundred markers may be called
using the default settings, and plotted using
plotGenotypes
. Individual markers whith questionable
calls can be investigated more closely using
callGenotypes.verboseTest
. This function performs all the tests
described above, for all suggested cluster combinations, without
failing any of the tests. Rather, all the test results are plotted and
returned, revealing which steps can be taken to improve the
clustering, if any. In these plots, green dots are the estimated
centre points, and red and orange dots represent arrays outside
cluster boundaries or within overlapping ellipses,
respectively. Several markers representing each of the possible
genotype categories should be investigated in this way before commencing
with genotyping the full set of markers.
For some markers with highly overlapping or inaccurate clusters,
interactive clustering might be the last option. This can be performed
using the function callGenotypes.interactive
(requires package
rggobi to be installed). This function
will loop through a smaller set of markers, allowing the user to
define clusters manually. Pedigree checking is performed inside the
function, and a phenoData
column “PedigreeID” is
required (see validateCallsPedigree
). Use the function
assignToAlleleSet
to incorporate the manual results into
a larger "AlleleSetIllumina"
object.
Output from setGenoOptions
is a list with all defined or
suggested options needed for genotype calling
Output from callGenotypes
is an
"AlleleSetIllumina"
object with an extra
assayData
entry:
call |
Matrix with numeric values {0, 1/4, 1/2, 3/4, 1}, each indicating the ratio of B alleles for a marker in a sample |
The "AlleleSetIllumina"
output from
callGenotypes.interactive
in addition holds the
assayData
entries:
ped.check |
Logical matrix with |
ped.check.parents |
Matrix of numeric values {0, 1, 2}, denoting
no error, offspring error, and parent error, respectively (see
|
The following columns are added to featureData
:
Classification |
Genotype call from automatic clustering |
Cent.Deviation |
Largest distance from cluster-centre to its ideal position |
Within.SD |
Largest within-cluster spread |
HW.Chi2 |
Chi-squared statistic from test of Hardy-Weinberg equilibrium |
HW.P |
Probability of Hardy-Weinberg equilibrium |
BAF.Locus1 |
Estimated B-allele frequency |
BAF.Locus2 |
Estimated BAF of second paralogue (if exists) |
Call.Rate |
Ratio of arrays assigned to clusters |
Manual.Calls.R |
Calls from interactive clustering (if applied) |
Output from callGenotypes.verboseTest
is a list containing
call |
The calls from each attempted cluster-combination |
fData |
Statistics from each cluster-combination |
fMetadata |
Explanation to each statistic |
test |
Table of “PASS” or “FAIL” for each test |
When pedigree data are available, pedigree checking is a very strong
type of validation which is performed after the genotype
calling in callGenotypes
In callGenotypes.interactive
, pedigree checking is performed
during genotype calling. This means the pedigree cannot
subsequently be used as validation to the same degree as if it had
been kept secret during genotype calling
Lars Gidskehaug
L. Gidskehaug, E. Anderssen, A. Flatberg, and B. K. Alsberg (2007) A
framework for significance analysis of gene expression data using
dimension reduction methods. BMC Bioinformatics 8:346
H. Hotelling (1931) The Generalization of Student's Ratio. Ann. Math. Stat. 2(3):360-378
generatePolyCenters
, getCenters
,
findPolyploidClusters
, testHardyWeinberg
,
ggobi
, plotGenotypes
,
validateCallsPedigree
, assignToAlleleSet
,
manualCall
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | ## Not run:
#Read 25 markers into an AlleleSetIllumina object
rPath <- system.file("extdata", package="beadarrayMSV")
normOpts <- setNormOptions()
dataFiles <- makeFilenames('testdata',normOpts,rPath)
beadFile <- paste(rPath,'beadData_testdata.txt',sep='/')
beadInfo <- read.table(beadFile,sep='\t',header=TRUE,as.is=TRUE)
BSRed <- createAlleleSetFromFiles(dataFiles[1:4],markers=1:25,beadInfo=beadInfo)
#Genotype calling and validation
gO <- setGenoOptions(largeSample=TRUE)
BSRed <- callGenotypes(BSRed, gO=gO)
BSRed <- validateCallsPedigree(BSRed)
sumClass <- tapply(rep(1,nrow(BSRed)),fData(BSRed)$Classification,sum)
print(sumClass)
#Plot default setting calls
plotGenotypes(BSRed)
#Tune settings to call an initially failed marker
dev.new()
verboseRes <- callGenotypes.verboseTest(BSRed, gO=gO, singleMarker=23)
print(verboseRes$fData)
print(verboseRes$test)
gO <- setGenoOptions(largeSample=TRUE, wSpreadLim=.1, hwAlpha=1e-50)
verboseRes <- callGenotypes.verboseTest(BSRed, gO=gO, singleMarker=23)
#New settings give (likely incorrect) SNP-call
BSRed <- callGenotypes(BSRed, gO=gO)
BSRed <- validateCallsPedigree(BSRed)
dev.new()
plotGenotypes(BSRed)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.