Description Usage Arguments Details Value Author(s) References Examples
This function combines all algorithms for processing of marker data within
synbreed
package. Raw marker data is a matrix with elements of
arbitrary format (e.g. alleles coded as pair of observed alleles
"A/T","G/C", ... , or by genotypes "AA", "BB", "AB"). The function is
limited to biallelic markers with a maximum of 3 genotypes per locus. Raw
data is recoded into the number of copies of a reference allele, i.e. 0, 1
and 2. Imputation of missing values can be done by random sampling from
allele distribution, the Beagle
software or family information (see
details). Additional preselection of markers can be carried out according to
the minor allele frequency and/or fraction of missing values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | codeGeno(
gpData,
impute = FALSE,
impute.type = c("random", "family", "beagle", "beagleAfterFamily", "beagleNoRand",
"beagleAfterFamilyNoRand", "fix"),
replace.value = NULL,
maf = NULL,
nmiss = NULL,
label.heter = "alleleCoding",
reference.allele = "minor",
keep.list = NULL,
keep.identical = TRUE,
verbose = FALSE,
minFam = 5,
showBeagleOutput = FALSE,
tester = NULL,
print.report = FALSE,
check = FALSE,
ploidy = 2,
cores = 1
)
|
gpData |
object of class |
impute |
|
impute.type |
|
replace.value |
|
maf |
|
nmiss |
|
label.heter |
This is either a scalar or vector of characters to
identify heterozygous genotypes or a function returning |
reference.allele |
Define the reference allele which is used for the
coding. Default is |
keep.list |
A vector with the names of markers, which should be kept during the process of coding and filtering. |
keep.identical |
|
verbose |
|
minFam |
For |
showBeagleOutput |
|
tester |
This option is in testing mode at the moment. |
print.report |
|
check |
This option has as default |
ploidy |
|
cores |
|
Coding of genotypic data is done in the following order (depending on choice of arguments; not all steps are performed):
1. Discarding markers with fraction > nmiss
of missing values
2. Recoding alleles from character/factor/numeric into the number of copies
of the minor alleles, i.e. 0, 1 and 2. In codeGeno
, in the first step
heterozygous genotypes are coded as 1. From the other genotypes, the less
frequent genotype is coded as 2 and the remaining genotype as 0. Note that
function codeGeno
will terminate with an error whenever more than
three genotypes are found.
2.1 Discarding duplicated markers if keep.identical=FALSE
before
starting of the imputing step. From identical marker based on pairwise
complete oberservations one is discarded randomly. For getting identical
results use the function set.seed()
before code.geno()
.
3. Replace missing values by replace.value
or impute missing values
according to one of the following methods:
Imputing is done according to impute.type
This option is only suitable for homozygous individuals (such as
doubled-haploid lines) structured in families. Suppose an observation
i is missing (NA) for a marker j in family k. If marker
j is fixed in family k, the imputed value will be the fixed
allele. If marker j is segregating for the population k, the
value is 0 with probability of 0.5 and 2 with probability of 0.5. To use
this algorithm, family information has to be stored as variable
family
in list element covar
of an object of class
gpData
. This column should contain a character
or
numeric
to identify family of all genotyped individuals.
Use Beagle Genetic Analysis Software Package version 4.1
(21Jan17.6cc) (Browning and Browning 2007; 2013; 2016) to infer missing
genotypes is used. This software is a java program, so that you have to
install java (>=1.7) and make it available at your computer. If you use the
beagle
option, please cite the original papers in publications.
Beagle uses a HMM to reconstruct missing genotypes by the flanking markers.
Function codeGeno
will create a directory beagle
for Beagle
input and output files (if it does not exist) and run Beagle with default
settings. The information on marker position is taken from element
map
. Indeed, the postion in map$pos
must be available for all
markers. The program can only handle the position units "bp", "kb" and "Mb".
Make sure that there are than only integer numbers for the unit "bp",
because beagle can only work with integer numbers. By default, three
genotypes 0, 1, 2 are imputed. To restrict the imputation only to homozygous
genotypes, use label.heter=NULL
.
In the
first step, missing genotypes are imputed according to the algorithm with
impute.type="family"
, but only for markers that are fixed within the
family. Moreover, markers with a missing position (map$pos=NA
) are
imputed using the algorithm of impute.type="family"
. In the second
step, the remaining genotypes are imputed by Beagle. For details of this see
the description of the beagle
option.
The same as the option beagle
,
respectively beagleAfterFamily
, except that markers without map
information will be not imputed.
The missing values for a marker j are sampled from the
marginal allele distribution of marker j. With 2 possible genotypes
(to force this option, use label.heter=NULL
), i.e. 0 and 2, values
are sampled from distribution with probabilities P(x=0)=1-p and
P(x=2)=p, where p is the minor allele frequency of marker
j. In the standardd case of 3 genotypes, i.e. with heterozygous
genotypes, values are sampled from distribution P(x=0)=(1-p)^2,
P(x=1)=p(1-p) and P(x=2)=p^2 assuming Hardy-Weinberg equilibrium
for all loci.
All missing values are imputed by
replace.value
. Note that only 0, 1 or 2 should be chosen.
4. Recoding of alleles after imputation, if necessary due to changes in allele frequencies caused by the imputed alleles
5. Discarding markers with a minor allele frequency of <= maf
6. Discarding duplicated markers if keep.identical=FALSE
. From
identical marker based on pairwise complete oberservations one is discarded
randomly. For getting identical results use the function set.seed()
before code.geno()
.
7. Restoring original data format (gpData
, matrix
or
data.frame
)
Information about imputing is reported after a call of codeGeno
.
Note: Beagle is included in the synbreed package. Once required, Beagle is
called using path.package()
.
An object of class gpData
containing the recoded marker
matrix. If maf
or nmiss
were specified or
keep.identical=FALSE
, dimension of geno
and map
may be
reduced due to selection of markers. The genotype which is homozygous for
the minor allele is coded as 2, the other homozygous genotype is coded as 0
and heterozygous genotype is coded as 1.
Hans-Juergen Auinger and Valentin Wimmer
S R Browning and B L Browning (2007) Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am J Hum Genet 81:1084-1097
B L Browning and S R Browning (2013) Improving the accuracy and efficiency of identity by descent detection in population data. Genetics 194(2):459-471
S R Browning and B L Browning (2016) Genotype imputation with millions of reference samples. Am J Hum Genet 98:116-126
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | # create marker data for 9 SNPs and 10 homozygous individuals
## Not run:
snp9 <- matrix(c(
"AA", "AA", "AA", "BB", "AA", "AA", "AA", "AA", NA,
"AA", "AA", "BB", "BB", "AA", "AA", "BB", "AA", NA,
"AA", "AA", "AB", "BB", "AB", "AA", "AA", "BB", NA,
"AA", "AA", "BB", "BB", "AA", "AA", "AA", "AA", NA,
"AA", "AA", "BB", "AB", "AA", "BB", "BB", "BB", "AB",
"AA", "AA", "BB", "BB", "AA", NA, "BB", "AA", NA,
"AB", "AA", "BB", "BB", "BB", "AA", "BB", "BB", NA,
"AA", "AA", NA, "BB", NA, "AA", "AA", "AA", "AA",
"AA", NA, NA, "BB", "BB", "BB", "BB", "BB", "AA",
"AA", NA, "AA", "BB", "BB", "BB", "AA", "AA", NA
),
ncol = 9, byrow = TRUE
)
# set names for markers and individuals
colnames(snp9) <- paste("SNP", 1:9, sep = "")
rownames(snp9) <- paste("ID", 1:10 + 100, sep = "")
# create object of class 'gpData'
gp <- create.gpData(geno = snp9)
# code genotypic data
gp.coded <- codeGeno(gp, impute = TRUE, impute.type = "random")
# comparison
gp.coded$geno
gp$geno
# example with maize data and imputing by family
data(maize)
# first only recode alleles
maize.coded <- codeGeno(maize, label.heter = 1)
# set 200 random chosen values to NA
set.seed(123)
ind1 <- sample(1:nrow(maize.coded$geno), 200)
ind2 <- sample(1:ncol(maize.coded$geno), 200)
original <- maize.coded$geno[cbind(ind1, ind2)]
maize.coded$geno[cbind(ind1, ind2)] <- NA
# imputing of missing values by family structure
maize.imputed <- codeGeno(maize.coded, impute = TRUE, impute.type = "family", label.heter = NULL)
# compare in a cross table
imputed <- maize.imputed$geno[cbind(ind1, ind2)]
(t1 <- table(original, imputed))
# sum of correct replacements
sum(diag(t1)) / sum(t1)
# compare with random imputation
maize.random <- codeGeno(maize.coded, impute = TRUE, impute.type = "random", label.heter = NULL)
imputed2 <- maize.random$geno[cbind(ind1, ind2)]
(t2 <- table(original, imputed2))
# sum of correct replacements
sum(diag(t2)) / sum(t2)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.