The main functionality of sspredr
entails:
Depending on the technology for calling SNP reads and depending on the data
format that you're using for storing SNP data you will have entries in your
SNP matrix that are either coded as strings, numeric values or integer values.
The function recode_snps
can translate between the string- and the
number-representation while taking into account which value encodes a major or
a minor allele at any locus.
First determine the major and minor allele at each locus of SNP-genotypes. We set the argument 'maf_threshold' to zero because at this point we do not want to apply a quality check for minor allele frequency.
library("sspredr") library("dplyr")
data("marker_character") # Check the original data format. typeof(marker_character) print(marker_character[, 1:5]) geno_lst <- compute_maf( marker_character, output = "geno_list", missing_value = "??", maf_threshold = 0 )
Next, we want to recode the major allele as 2
, the minor allele as 0
and
heterozygotes as 1
.
marker_numeric <- sspredr::recode_snps( marker_character, major = geno_lst[["major_genotype"]], minor = geno_lst[["minor_genotype"]], major_coding = 2, minor_coding = 0, het_coding = 1, missing_value = "??", na_coding = NA_real_ ) # Check the new data format. typeof(marker_numeric) print(marker_numeric[, 1:5])
In many cases your 'raw' SNP data will contain missing values. Since throwing them away would be a considerable waste of your resources you should strive to impute them. I recommend to use an imputation software such as BEAGLE for this purpose, but here I've also provided a simple function which imputes missing values with the major allele at any locus:
data("marker_numeric") # Examine the data. print(marker_numeric) marker_numeric %>% is.na() %>% sum() # First, compute the major genotype at each locus. major_genotype <- marker_numeric %>% sspredr::compute_maf( x = ., output = "geno_list", missing = NA_real_, maf_threshold = 0 ) %>% .[["major_genotype"]] # Second, replace all missing genotypes with the missing allele. imp_numeric <- marker_numeric %>% sspredr::replace_na_with_major_genotype( dat = ., missing_value = NA_real_, major_genotype = major_genotype ) print(imp_numeric) # Check whether there are still missing values in the marker matrix. imp_numeric %>% is.na() %>% sum()
In any genomic prediction it is crucial to filter SNPs based on the criteria call frequency, minor allele frequency and heteroyzgosity rate.
In the following example we have a matrix of SNPs where some loci comprise an excessively large amount of heterozygosity. We would like to keep only loci where the heterozygosity rate is less than say 10%.
data("imp_snps") # Return the names of all markers with a heterozygosity of less than 10%. het_loci_nms <- sspredr::compute_het( imp_snps, output = "marker_names", het_threshold = 0.1 ) low_het_snps <- imp_snps[, colnames(imp_snps) %in% het_loci_nms] print(paste("The original number of loci was:", ncol(imp_snps))) print(paste("The number of low-heterozygosity loci is:", ncol(low_het_snps)))
# Return the heterozygosity at each locus. loci_het_rate <- imp_snps %>% sspredr::compute_het( output = "marker_heterozygosity", het_threshold = 0 )
In order to remove SNPs with low call frequency, low minor allele frequency
or to remove duplicated loci you can use the ensure_snp_quality
function.
Additionally, if an imputation step is yet missing, this function also
substitutes missing values with the major allele at any given locus.
Here, we'll use a SNP matrix encoded using the values 0
, 1
and 2
as well
as missing values and duplicate one of the loci to demonstrate the capabilities
this function.
data("marker_numeric") # Add a duplicated marker locus to the matrix. snp <- marker_numeric snp <- cbind(snp, snp[, 1]) print(snp)
We will check the call frequency and the minor allele frequency at each locus,
specify that the matrix contains missing values by specifying any_missing=TRUE
and remove duplicated marker loci:
clean_snps <- sspredr::ensure_snp_quality( snp = marker_numeric, callfreq_check = TRUE, callfreq_threshold = 0.95, maf_check = TRUE, maf_threshold = 0.05, any_missing = TRUE, missing_value = NA_real_, remove_duplicated = TRUE ) print(paste("The original number of loci was:", ncol(marker_numeric))) print(paste("The number of loci after quality checking is:", ncol(clean_snps))) print(clean_snps)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.