Description Usage Arguments Details Value Author(s) References Examples
This function combines all algorithms for processing of marker data within synbreed
package.
Raw marker data is a matrix with elements of arbitrary format (e.g. alleles coded as pair of observed alleles "A/T","G/C", ... , or by genotypes "AA", "BB", "AB"). The function is limited to biallelic markers with a maximum of 3 genotypes per locus. Raw data is recoded into the number of copies of a reference allele, i.e. 0, 1 and 2.
Imputation of missing values can be done by random sampling from allele distribution, the Beagle
software or family information (see details). Additional preselection of markers can be carried out according to the minor allele frequency and/or fraction of missing values.
1 2 3 4 5 6 7 | codeGeno(gpData,impute=FALSE,
impute.type=c("random","family","beagle","beagleAfterFamily","beagleNoRand",
"beagleAfterFamilyNoRand","fix"),
replace.value=NULL, maf=NULL, nmiss=NULL, label.heter="alleleCoding",
reference.allele="minor", keep.list=NULL, keep.identical=TRUE, verbose=FALSE,
minFam=5, showBeagleOutput=FALSE, tester=NULL, print.report=FALSE, check=FALSE,
ploidy=2, cores=1)
|
gpData |
object of class |
impute |
|
impute.type |
|
replace.value |
|
maf |
|
nmiss |
|
label.heter |
This is either a scalar or vector of characters to identify heterozygous genotypes or a function returning |
reference.allele |
Define the reference allele which is used for the coding. Default is |
keep.list |
A vector with the names of markers, which should be kept during the process of coding and filtering. |
keep.identical |
|
verbose |
|
minFam |
For |
showBeagleOutput |
|
tester |
This option is in testing mode at the moment. |
print.report |
|
check |
This option has as default |
ploidy |
|
cores |
|
Coding of genotypic data is done in the following order (depending on choice of arguments; not all steps are performed):
1. Discarding markers with fraction > nmiss
of missing values
2. Recoding alleles from character/factor/numeric into the number of copies of the minor alleles, i.e. 0, 1 and 2. In codeGeno
, in the first step heterozygous genotypes are coded as 1. From the other genotypes, the less frequent genotype is coded as 2 and the remaining genotype as 0. Note that function codeGeno
will terminate with an error whenever more than three genotypes are found.
2.1 Discarding duplicated markers if keep.identical=FALSE
before starting of the imputing step. From identical marker based on pairwise complete oberservations one is discarded randomly. For getting identical results use the function set.seed()
before code.geno()
.
3. Replace missing values by replace.value
or impute missing values according to one of the following methods:
Imputing is done according to impute.type
This option is only suitable for homozygous individuals (such as doubled-haploid lines) structured in families.
Suppose an observation i is missing (NA) for a marker j in family k. If marker j is fixed in family k, the imputed value will be the fixed allele. If marker j is segregating for the population k,
the value is 0 with probability of 0.5 and 2 with probability of 0.5. To use this algorithm, family information has to be stored as variable family
in list element covar
of an object of class gpData
. This column should contain a character
or numeric
to identify family of all genotyped individuals.
Use Beagle Genetic Analysis Software Package version 4.1 (21Jan17.6cc) (Browning and Browning 2007; 2013; 2016) to infer missing genotypes is used. This software is a java program, so that you have to install java (>=1.7) and make it available at your computer. If you use the beagle
option, please cite the original papers in publications. Beagle uses a HMM to reconstruct missing genotypes by the flanking markers. Function codeGeno
will create a directory beagle
for Beagle input and output files (if it does not exist) and run Beagle with default settings. The information on marker position is taken from element map
. Indeed, the postion in map$pos
must be available for all markers. The program can only handle the position units "bp", "kb" and "Mb". Make sure that there are than only integer numbers for the unit "bp", because beagle can only work with integer numbers. By default, three genotypes 0, 1, 2 are imputed. To restrict the imputation only to homozygous genotypes, use label.heter=NULL
.
In the first step, missing genotypes are imputed according to the algorithm with impute.type="family"
, but only for markers that are fixed within the family. Moreover, markers with a missing position (map$pos=NA
) are imputed using the algorithm of impute.type="family"
. In the second step, the remaining genotypes are imputed by Beagle. For details of this see the description of the beagle
option.
The same as the option beagle
, respectively beagleAfterFamily
, except that markers without map information will be not imputed.
The missing values for a marker j are sampled from the marginal allele distribution of marker j. With 2 possible genotypes (to force this option, use label.heter=NULL
), i.e. 0 and 2, values are sampled from distribution with probabilities P(x=0)=1-p and P(x=2)=p, where p is the minor allele frequency of marker j. In the standardd case of 3 genotypes, i.e. with heterozygous genotypes, values are sampled from distribution P(x=0)=(1-p)^2, P(x=1)=p(1-p) and P(x=2)=p^2 assuming Hardy-Weinberg equilibrium for all loci.
All missing values are imputed by replace.value
. Note that only 0, 1 or 2 should be chosen.
4. Recoding of alleles after imputation, if necessary due to changes in allele frequencies caused by the imputed alleles
5. Discarding markers with a minor allele frequency of <= maf
6. Discarding duplicated markers if keep.identical=FALSE
. From identical marker based on pairwise complete oberservations one is discarded randomly. For getting identical results use the function set.seed()
before code.geno()
.
7. Restoring original data format (gpData
, matrix
or data.frame
)
Information about imputing is reported after a call of codeGeno
.
Note: Beagle is included in the synbreed package. Once required, Beagle is called using path.package()
.
An object of class gpData
containing the recoded marker matrix. If maf
or nmiss
were specified or keep.identical=FALSE
, dimension of geno
and map
may be reduced due to selection of markers. The genotype which is homozygous for the minor allele is coded as 2, the other homozygous genotype is coded as 0 and heterozygous genotype is coded as 1.
Hans-Juergen Auinger and Valentin Wimmer
S R Browning and B L Browning (2007) Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am J Hum Genet 81:1084-1097
B L Browning and S R Browning (2013) Improving the accuracy and efficiency of identity by descent detection in population data. Genetics 194(2):459-471
S R Browning and B L Browning (2016) Genotype imputation with millions of reference samples. Am J Hum Genet 98:116-126
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | # create marker data for 9 SNPs and 10 homozygous individuals
snp9 <- matrix(c(
"AA", "AA", "AA", "BB", "AA", "AA", "AA", "AA", NA,
"AA", "AA", "BB", "BB", "AA", "AA", "BB", "AA", NA,
"AA", "AA", "AB", "BB", "AB", "AA", "AA", "BB", NA,
"AA", "AA", "BB", "BB", "AA", "AA", "AA", "AA", NA,
"AA", "AA", "BB", "AB", "AA", "BB", "BB", "BB", "AB",
"AA", "AA", "BB", "BB", "AA", NA, "BB", "AA", NA,
"AB", "AA", "BB", "BB", "BB", "AA", "BB", "BB", NA,
"AA", "AA", NA, "BB", NA, "AA", "AA", "AA", "AA",
"AA", NA, NA, "BB", "BB", "BB", "BB", "BB", "AA",
"AA", NA, "AA", "BB", "BB", "BB", "AA", "AA", NA),
ncol=9,byrow=TRUE)
# set names for markers and individuals
colnames(snp9) <- paste("SNP",1:9,sep="")
rownames(snp9) <- paste("ID",1:10+100,sep="")
# create object of class 'gpData'
gp <- create.gpData(geno=snp9)
# code genotypic data
gp.coded <- codeGeno(gp,impute=TRUE,impute.type="random")
# comparison
gp.coded$geno
gp$geno
# example with heterogeneous stock mice
## Not run:
library(synbreedData)
data(mice)
summary(mice)
# heterozygous values must be labeled (may run some seconds)
mice.coded <- codeGeno(mice,label.heter=function(x) substr(x,1,1)!=substr(x,3,3))
# example with maize data and imputing by family
data(maize)
# first only recode alleles
maize.coded <- codeGeno(maize,label.heter=1)
# set 200 random chosen values to NA
set.seed(123)
ind1 <- sample(1:nrow(maize.coded $geno),200)
ind2 <- sample(1:ncol(maize.coded $geno),200)
original <- maize.coded$geno[cbind(ind1,ind2)]
maize.coded$geno[cbind(ind1,ind2)] <- NA
# imputing of missing values by family structure
maize.imputed <- codeGeno( maize.coded,impute=TRUE,impute.type="family",label.heter=NULL)
# compare in a cross table
imputed <- maize.imputed$geno[cbind(ind1,ind2)]
(t1 <- table(original,imputed) )
# sum of correct replacements
sum(diag(t1))/sum(t1)
# compare with random imputation
maize.random <- codeGeno(maize.coded,impute=TRUE,impute.type="random",label.heter=NULL)
imputed2 <- maize.random$geno[cbind(ind1,ind2)]
(t2 <- table(original,imputed2) )
# sum of correct replacements
sum(diag(t2))/sum(t2)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.