The main functionality of sspredr entails:

  1. Convert between digit- and string-representations of SNPs.
  2. Check for SNP heterozygosity, minor allele frequency and call frequency.
  3. Create ETA objects for single-predictor predictions using BGLR.
  4. Create ETA objects for single-step prediction using BGLR.

Genotype encoding

Depending on the technology for calling SNP reads and depending on the data format that you're using for storing SNP data you will have entries in your SNP matrix that are either coded as strings, numeric values or integer values. The function recode_snps can translate between the string- and the number-representation while taking into account which value encodes a major or a minor allele at any locus.

First determine the major and minor allele at each locus of SNP-genotypes. We set the argument 'maf_threshold' to zero because at this point we do not want to apply a quality check for minor allele frequency.

library("sspredr")
library("dplyr")
data("marker_character")

# Check the original data format.
typeof(marker_character)
print(marker_character[, 1:5])

geno_lst <- compute_maf(
  marker_character,
  output = "geno_list",
  missing_value = "??",
  maf_threshold = 0
  )

Next, we want to recode the major allele as 2, the minor allele as 0 and heterozygotes as 1.

marker_numeric <- sspredr::recode_snps(
  marker_character,
  major = geno_lst[["major_genotype"]],
  minor = geno_lst[["minor_genotype"]],
  major_coding = 2,
  minor_coding = 0,
  het_coding = 1,
  missing_value = "??",
  na_coding = NA_real_
  )

# Check the new data format.
typeof(marker_numeric)
print(marker_numeric[, 1:5])

Missing values

In many cases your 'raw' SNP data will contain missing values. Since throwing them away would be a considerable waste of your resources you should strive to impute them. I recommend to use an imputation software such as BEAGLE for this purpose, but here I've also provided a simple function which imputes missing values with the major allele at any locus:

data("marker_numeric")

# Examine the data.
print(marker_numeric)
marker_numeric %>%
  is.na() %>%
  sum()

# First, compute the major genotype at each locus.
major_genotype <- marker_numeric %>%
  sspredr::compute_maf(
    x = .,
    output = "geno_list",
    missing = NA_real_,
    maf_threshold = 0
  ) %>%
  .[["major_genotype"]]

# Second, replace all missing genotypes with the missing allele.
imp_numeric <- marker_numeric %>%
  sspredr::replace_na_with_major_genotype(
    dat = .,
    missing_value = NA_real_,
    major_genotype = major_genotype
    )
print(imp_numeric)

# Check whether there are still missing values in the marker matrix.
imp_numeric %>%
  is.na() %>%
  sum()

SNP quality filtering

In any genomic prediction it is crucial to filter SNPs based on the criteria call frequency, minor allele frequency and heteroyzgosity rate.

SNP heterozygosity

In the following example we have a matrix of SNPs where some loci comprise an excessively large amount of heterozygosity. We would like to keep only loci where the heterozygosity rate is less than say 10%.

data("imp_snps")

# Return the names of all markers with a heterozygosity of less than 10%.
het_loci_nms <- sspredr::compute_het(
  imp_snps,
  output = "marker_names",
  het_threshold = 0.1
  )

low_het_snps <- imp_snps[, colnames(imp_snps) %in% het_loci_nms]
print(paste("The original number of loci was:", ncol(imp_snps)))
print(paste("The number of low-heterozygosity loci is:", ncol(low_het_snps)))
# Return the heterozygosity at each locus.
loci_het_rate <- imp_snps %>%
  sspredr::compute_het(
    output = "marker_heterozygosity",
    het_threshold = 0
    )

Call frequency and minor allele frequency

In order to remove SNPs with low call frequency, low minor allele frequency or to remove duplicated loci you can use the ensure_snp_quality function. Additionally, if an imputation step is yet missing, this function also substitutes missing values with the major allele at any given locus.

Here, we'll use a SNP matrix encoded using the values 0, 1 and 2 as well as missing values and duplicate one of the loci to demonstrate the capabilities this function.

data("marker_numeric")

# Add a duplicated marker locus to the matrix.
snp <- marker_numeric
snp <- cbind(snp, snp[, 1])
print(snp)

We will check the call frequency and the minor allele frequency at each locus, specify that the matrix contains missing values by specifying any_missing=TRUE and remove duplicated marker loci:

clean_snps <- sspredr::ensure_snp_quality(
  snp = marker_numeric,
  callfreq_check = TRUE,
  callfreq_threshold = 0.95,
  maf_check = TRUE,
  maf_threshold = 0.05,
  any_missing = TRUE,
  missing_value = NA_real_,
  remove_duplicated = TRUE
  )

print(paste("The original number of loci was:", ncol(marker_numeric)))
print(paste("The number of loci after quality checking is:", ncol(clean_snps)))
print(clean_snps)


mwesthues/sspredr documentation built on May 23, 2019, 10:56 a.m.