is a function to exclude individuals and SNPs that do not satisfy one of the following criteria,
SNP genotypes are above the provided call rate threshold, default = 0.85
SNP genotypes are above the provided heterozygosity rate, default = 0.01
SNP loci are not far from Hardy-Weinberg equilibrium
SNP genotypes are above the provided minimum for minor allele frequency, default = 0.005
Matrix of SNP genotypes as prepared by
The SNP criterion to use in selecting individuals to keep, default is 'call.rate'.
Call rate threshold provided or the minimum allowed call rate.
Heterozygosity threshold provided, less polymorphic SNPs are excluded.
Heterozygosity difference used in HW equilibrium, see ‘Details’.
Minor allele frequency threshold provided, SNPs with lower frequencies for the minor allele are excluded.
This is a selection tool to extract high quality individuals and SNPs. Based on the
selection criterion provided,
snpSelect keeps individuals and SNPs that meet the criterion.
Providing multiple selection criteria to
select.method does not work. If a cleanup
requires extraction of individuals and SNPs based on multiple criteria, the function may
be used iteratively.
In the case of
select.method = ‘call.rate’, individuals not SNPs are excluded.
In all other cases, SNPs are excluded.
For selection based on HW equilibrium, the expected frequency of the heterozygous
genotype is estimated based on the two homozygous genetypes by ‘sqrt(4*p^2*q^2)’.
The absolute difference between the expected and observed frequency of the heterozygous
genotype of each SNP needs to be smaller than the minimum difference of
The logical function
is.hwEq(x, diff) with a TRUE or FALSE output can be used to
test SNP genotypes in the integer vector,
x, based of the difference,
A matrix with individuals exceeding the
call.rate.min threshold or a matrix with
hz.min. A minimum heterozygosity instead of 0 is recommended to leave
room for genotying errors. The function can also be used to exclude SNPs not in
Hardy-Weinberg equilibrium if the difference between expected and observed genotype
frequencies is greater than
snpSelect may be used to exclude SNPs
with low minor allele frequencies. If
MAF.min is not supplied, the default of 0.5% is
Falconer and Mackay (1996). Introduction to Quantitative Genetics (4th Edition). Pearson Education Limited, Edinburgh, England.
Wiggans et al. (2009). Selection of single-nucleotide polymorphisms and quality of genotypes used in genomic evaluation of dairy cattle in the United States and Canada. Journal of Dairy Science, 92, 3431-3436.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## Simulate random allele designations for 100 bi-allelic SNPs set.seed(2016) desig <- array(sample(c('A','C','G','T'), size = 200, repl = TRUE), dim=c(100, 2)) ## Simulate random SNP genotypes for 20 individuals - put them in array format ## '-' indicates an unknown base ga <- array(0, dim=c(20, 100)) for(i in 1:20) for(j in 1:100) ga[i, j] <- paste(sample(c(desig[j,],"-"), 2, prob=c(.47, .47, .06), repl=TRUE), collapse='') ## Recode the matrix, place recoded genotypes in ga.r desig <- data.frame(AlleleA_Forward = factor(desig[,1]), AlleleB_Forward = factor(desig[,2])) ga.r <- array(5, dim=c(20, 100)) for(i in 1:100) ga.r[,i] <- snpRecode(ga[,i], desig[i,]) ## Exclude individuals with call rates below 85% dim(snpSelect(ga.r, select.method = "call.rate", call.rate.min = 0.85)) # 15 100 ## Exclude SNPs with heterozygosity <= 1% dim(snpSelect(ga.r, select.method = "heterozygosity", hz.min = 0.01)) # 20 79 ## If the difference between expected and observed genotype frequencies ## is > 0.15, exclude the SNP. dim(snpSelect(ga.r, select.method = "HW.Eq", hz.diff = 0.15)) # 20 29 ## Select SNPs with minor allele frequencies greater than 0.5% dim(snpSelect(ga.r, select.method = "MAF", MAF.min = 0.005)) # 20 79