codeGeno: Recode genotypic data, imputation of missing values and...
In synbreed: Framework for the Analysis of Genomic Prediction Data using R

Description Usage Arguments Details Value Author(s) References Examples

This function combines all algorithms for processing of marker data within synbreed package. Raw marker data is a matrix with elements of arbitrary format (e.g. alleles coded as pair of observed alleles "A/T","G/C", ... , or by genotypes "AA", "BB", "AB"). The function is limited to biallelic markers with a maximum of 3 genotypes per locus. Raw data is recoded into the number of copies of a reference allele, i.e. 0, 1 and 2. Imputation of missing values can be done by random sampling from allele distribution, the Beagle software or family information (see details). Additional preselection of markers can be carried out according to the minor allele frequency and/or fraction of missing values.

codeGeno(gpData,impute=FALSE,
         impute.type=c("random","family","beagle","beagleAfterFamily","beagleNoRand",
                       "beagleAfterFamilyNoRand","fix"),
         replace.value=NULL, maf=NULL, nmiss=NULL, label.heter="alleleCoding",
         reference.allele="minor", keep.list=NULL, keep.identical=TRUE, verbose=FALSE,
         minFam=5, showBeagleOutput=FALSE, tester=NULL, print.report=FALSE, check=FALSE,
         ploidy=2, cores=1)

`gpData`	object of class `gpData` with arbitrary coding in element `geno`. Missing values have to be coded as `NA`.
`impute`	`logical`. Should missing value be replaced by imputing?
`impute.type`	`character` with one out of `"fix"`, `"random"` , `"family"`, `"beagle"`, `"beagleAfterFamily"` , `"beagleAfterFamilyNoRand"`, `"beagleAfterFamilyNoRand"` (default = `"random"`). See details.
`replace.value`	`numeric` scalar to replace missing values in case `impute.type="fix"`.
`maf`	`numeric` scalar. Threshold to discard markers due to the minor allele frequency (MAF). Markers with a MAF < `maf` are discarded, thus `maf` in [0,0.5]. If `map` in `gpData` is available, markers are also removed from `map`.
`nmiss`	`numeric` scalar. Markers with more than `nmiss` fraction of missing values are discarded, thus `nmiss` in [0,1]. If `map` in `gpData` is available, markers are also removed from `map`.
`label.heter`	This is either a scalar or vector of characters to identify heterozygous genotypes or a function returning `TRUE` if an element of the marker matrix is the heterozygous genotype. Defining a function is useful, if number of unique heterozygous genotypes is large, i.e. if genotypes are coded by alleles. If the heterozygous genotype is coded like "A/T","G/C", ..., "AG", "CG", ..., "T:C", "G:A", ... or "G\|T", "A\|C", ... then `label.heter="alleleCoding"` can be used (This is the default). Note that heterozygous values must be identified unambiguously by `label.heter`. Use `label.heter=NULL` if there are only homozygous genotypes, i.e. in DH lines, to speed up computation and restrict imputation to values 0 and 2.
`reference.allele`	Define the reference allele which is used for the coding. Default is `"minor"`, i.e. data is coded by the number of copies of the minor allele. Alternatively, `reference.allele` can specify a single character defining the reference allele for all markers, or a vector defining marker-specific reference alleles (using the same order as of the markers in `gpData`). In case you have already a gpObject with `info$codeGeno == TRUE`, and like only to use higher maf or remove duplicated markers, you can use the option `"keep"`, than the coding of the original object is kept.
`keep.list`	A vector with the names of markers, which should be kept during the process of coding and filtering.
`keep.identical`	`logical`. Should duplicated markers be kept? NOTE: From a set of identical markers (with respect to the non-missing alleles) the one with the smallest number of missing values is kept. For those with an identical number of missing values, the first one is kept and all others are removed.
`verbose`	`logical`. If `TRUE` verbose output is generated during the steps of the algorithm. This is useful to obtain numbers of discarded markers due to different criteria.
`minFam`	For `impute.type` `family` and `beagleAfterFamily`, each family should have at least `minFam` members with available information for a marker to impute missing values according to the family. The default is 5.
`showBeagleOutput`	`logical`. Would you like to see the output of the Beagle software package? The default is `FALSE`.
`tester`	This option is in testing mode at the moment.
`print.report`	`logical`. Should a file `SNPreport.txt` be generated containing further information on SNPs. This includes SNP name, original coding of major and minor allele, MAF and number of imputed values.
`check`	This option has as default `FALSE`. If something seems to be wrong with the coding, with the option `check=TRUE` the function tries to catch the error.
`ploidy`	`numeric`. Here you can specify the number homologous alleles. For this option you need a coding with A and B, e.g. for tetraploids "AAAA", "AAAB", "AABB", "ABBB" and "BBBB". In the `geno` element of your gpData-object will be than the dosage of the minor allele. The argument label.heter is ignored for `ploidy >2`
`cores`	`numeric`. Here you can specify the number of cores you like to use.

Coding of genotypic data is done in the following order (depending on choice of arguments; not all steps are performed):

1. Discarding markers with fraction > nmiss of missing values

2. Recoding alleles from character/factor/numeric into the number of copies of the minor alleles, i.e. 0, 1 and 2. In codeGeno, in the first step heterozygous genotypes are coded as 1. From the other genotypes, the less frequent genotype is coded as 2 and the remaining genotype as 0. Note that function codeGeno will terminate with an error whenever more than three genotypes are found.

2.1 Discarding duplicated markers if keep.identical=FALSE before starting of the imputing step. From identical marker based on pairwise complete oberservations one is discarded randomly. For getting identical results use the function set.seed() before code.geno().

3. Replace missing values by replace.value or impute missing values according to one of the following methods:

Imputing is done according to impute.type

"family": This option is only suitable for homozygous individuals (such as doubled-haploid lines) structured in families. Suppose an observation i is missing (NA) for a marker j in family k. If marker j is fixed in family k, the imputed value will be the fixed allele. If marker j is segregating for the population k, the value is 0 with probability of 0.5 and 2 with probability of 0.5. To use this algorithm, family information has to be stored as variable family in list element covar of an object of class gpData. This column should contain a character or numeric to identify family of all genotyped individuals.
"beagle": Use Beagle Genetic Analysis Software Package version 4.1 (21Jan17.6cc) (Browning and Browning 2007; 2013; 2016) to infer missing genotypes is used. This software is a java program, so that you have to install java (>=1.7) and make it available at your computer. If you use the beagle option, please cite the original papers in publications. Beagle uses a HMM to reconstruct missing genotypes by the flanking markers. Function codeGeno will create a directory beagle for Beagle input and output files (if it does not exist) and run Beagle with default settings. The information on marker position is taken from element map. Indeed, the postion in map$pos must be available for all markers. The program can only handle the position units "bp", "kb" and "Mb". Make sure that there are than only integer numbers for the unit "bp", because beagle can only work with integer numbers. By default, three genotypes 0, 1, 2 are imputed. To restrict the imputation only to homozygous genotypes, use label.heter=NULL.
"beagleAfterFamily": In the first step, missing genotypes are imputed according to the algorithm with impute.type="family", but only for markers that are fixed within the family. Moreover, markers with a missing position (map$pos=NA) are imputed using the algorithm of impute.type="family". In the second step, the remaining genotypes are imputed by Beagle. For details of this see the description of the beagle option.
"beagleNoRand" and "beagleAfterFamilyNoRand": The same as the option beagle, respectively beagleAfterFamily, except that markers without map information will be not imputed.
"random": The missing values for a marker j are sampled from the marginal allele distribution of marker j. With 2 possible genotypes (to force this option, use label.heter=NULL), i.e. 0 and 2, values are sampled from distribution with probabilities P(x=0)=1-p and P(x=2)=p, where p is the minor allele frequency of marker j. In the standardd case of 3 genotypes, i.e. with heterozygous genotypes, values are sampled from distribution P(x=0)=(1-p)^2, P(x=1)=p(1-p) and P(x=2)=p^2 assuming Hardy-Weinberg equilibrium for all loci.
"fix": All missing values are imputed by replace.value. Note that only 0, 1 or 2 should be chosen.

4. Recoding of alleles after imputation, if necessary due to changes in allele frequencies caused by the imputed alleles

5. Discarding markers with a minor allele frequency of <= maf

6. Discarding duplicated markers if keep.identical=FALSE. From identical marker based on pairwise complete oberservations one is discarded randomly. For getting identical results use the function set.seed() before code.geno().

7. Restoring original data format (gpData, matrix or data.frame)

Information about imputing is reported after a call of codeGeno.

Note: Beagle is included in the synbreed package. Once required, Beagle is called using path.package().

An object of class gpData containing the recoded marker matrix. If maf or nmiss were specified or keep.identical=FALSE, dimension of geno and map may be reduced due to selection of markers. The genotype which is homozygous for the minor allele is coded as 2, the other homozygous genotype is coded as 0 and heterozygous genotype is coded as 1.

Hans-Juergen Auinger and Valentin Wimmer

S R Browning and B L Browning (2007) Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am J Hum Genet 81:1084-1097

B L Browning and S R Browning (2013) Improving the accuracy and efficiency of identity by descent detection in population data. Genetics 194(2):459-471

S R Browning and B L Browning (2016) Genotype imputation with millions of reference samples. Am J Hum Genet 98:116-126

# create marker data for 9 SNPs and 10 homozygous individuals
snp9 <- matrix(c(
  "AA",   "AA",   "AA",   "BB",   "AA",   "AA",   "AA",   "AA",  NA,
  "AA",   "AA",   "BB",   "BB",   "AA",   "AA",   "BB",   "AA",  NA,
  "AA",   "AA",   "AB",   "BB",   "AB",   "AA",   "AA",   "BB",  NA,
  "AA",   "AA",   "BB",   "BB",   "AA",   "AA",   "AA",   "AA",  NA,
  "AA",   "AA",   "BB",   "AB",   "AA",   "BB",   "BB",   "BB",  "AB",
  "AA",   "AA",   "BB",   "BB",   "AA",   NA,     "BB",   "AA",  NA,
  "AB",   "AA",   "BB",   "BB",   "BB",   "AA",   "BB",   "BB",  NA,
  "AA",   "AA",    NA,    "BB",    NA,    "AA",   "AA",   "AA",  "AA",
  "AA",    NA,     NA,    "BB",   "BB",   "BB",   "BB",   "BB",  "AA",
  "AA",    NA,    "AA",   "BB",   "BB",   "BB",   "AA",   "AA",  NA),
  ncol=9,byrow=TRUE)

# set names for markers and individuals
colnames(snp9) <- paste("SNP",1:9,sep="")
rownames(snp9) <- paste("ID",1:10+100,sep="")

# create object of class 'gpData'
gp <- create.gpData(geno=snp9)

# code genotypic data
gp.coded <- codeGeno(gp,impute=TRUE,impute.type="random")

# comparison
gp.coded$geno
gp$geno

# example with heterogeneous stock mice
## Not run: 
library(synbreedData)
data(mice)
summary(mice)
# heterozygous values must be labeled  (may run some seconds)
mice.coded <- codeGeno(mice,label.heter=function(x) substr(x,1,1)!=substr(x,3,3))


# example with maize data and imputing by family
data(maize)
# first only recode alleles
maize.coded <- codeGeno(maize,label.heter=1)

# set 200 random chosen values to NA
set.seed(123)
ind1 <- sample(1:nrow(maize.coded $geno),200)
ind2 <- sample(1:ncol(maize.coded $geno),200)
original <- maize.coded$geno[cbind(ind1,ind2)]

maize.coded$geno[cbind(ind1,ind2)] <- NA
# imputing of missing values by family structure
maize.imputed <- codeGeno( maize.coded,impute=TRUE,impute.type="family",label.heter=NULL)


# compare in a cross table
imputed <- maize.imputed$geno[cbind(ind1,ind2)]
(t1 <- table(original,imputed) )
# sum of correct replacements
sum(diag(t1))/sum(t1)

# compare with random imputation
maize.random <- codeGeno(maize.coded,impute=TRUE,impute.type="random",label.heter=NULL)
imputed2 <- maize.random$geno[cbind(ind1,ind2)]
(t2 <- table(original,imputed2) )
# sum of correct replacements
sum(diag(t2))/sum(t2)

## End(Not run)