inferHaplotypes: infer haplotypes for one or more haploblocks

Description Usage Arguments Details Value Examples

View source: R/PolyHaplotyper.R


infer haplotypes for one or more haploblocks, for all individuals, using FS family(s) (with parents) if present, and infer haplotypes for non-FS material as well


inferHaplotypes(mrkDosage, indiv=NULL, ploidy, haploblock,
parents=NULL, FS=NULL, minfrac=c(0.1, 0.01), errfrac=0.025, DRrate=0.025,
maxmrk=0, dropUnused=TRUE, maxparcombs=150000, minPseg=1e-8,
knownHap=integer(0), progress=TRUE, printtimes=FALSE, ahcdir)



matrix or data.frame. Markers are in rows, individuals in columns, each cell has a marker dosage. Names of individuals are the column names, marker names are the row names or (if a data.frame) in a column named MarkerNames. All marker dosages must be in 0:ploidy or NA.


NULL (default) or a character vector with names of all individuals to be considered. If NULL, all columns of mrkDosage are selected.
All indivs that are not in parents or FS vectors (see below) are considered unrelated, i.e. we have no implementation for pedigrees (yet).


all marker dosages should be in 0:ploidy or NA


a list of character vectors. The names are the names of the haploblocks, the character vectors have the names of the markers in each haploblock. Haplotype names are constructed from the haploblock names, which are used as prefixes to which the (zero-padded) haplotype numbers are are appended with separator '_'.


a matrix with one row for each FS family and two columns for the two parents, containing the names of the female and male parent of each family.


a list of character vectors. Each character vector has the names of the individuals of one FS family. The items of the list should correspond to the rows of the parents matrix, in the same order.


vector of two fractions, default 0.1 and 0.01. A haplotype is considered to be certainly present if it must occur in at least a fraction minfrac[1] of all individuals; in the final stage for the "other" individuals (those that do not belong to the FS or its parents) this fraction is lowered to minfrac[2]; see also inferHaps_noFS


default 0.025. The assumed fraction marker genotypes with an error (over all markers in the haploblock). The errors are assumed to be uniformly distributed over all except the original marker dosage combinations (mrkdids)


default 0.025. The rate of double reduction per meiosis (NOT per allele!); e.g. with a DRrate of 0.04, a tetraploid parent with genotype ABCD will produce a fraction of 0.04 of DR gametes AA, BB, CC and DD (each with a frequency of 0.01), and a fraction of 0.96 of the non-DR gametes AB, AC, AD, BC, BD, CD (each with a frequency of 0.16)


Haploblocks with more than maxmrk markers will be skipped. Default 0: no haploblocks are skipped


TRUE (default) if the returned matrix should only contain rows for haplotypes that are present; if FALSE matrix contains rows for all possible haplotypes


Parent 1 and 2 both may have multiple possible haplotype combinations. For each pair of haplotype combinations (one from P1 and one from P2) the expected FS segregation must be checked against the observed. This may take a long time if many such combinations need to be checked. This parameter sets a limit to the number of allowed combinations per haploblock; default 150000 takes about 45 min.


default 1e-8. The minimum P-value of a chisquared test for segregation in FS families. The best solution for an FS family is selected based on a combination of P-value and number of required haplotypes, among all candidate solutions with a P-value of at least minPseg. If no such solution is found the FS and its parents are treated as unrelated material


integer vector with haplotype numbers (haplotypes that must be present according to prior inference or knowledge, numbers refer to rows of matrix produced by allHaplotypes); default integer(0), i.e. no known haplotypes


if TRUE, and new haplotype combinations need to be calculated, and the number of markers and the ploidy are both >= 6, progress is indicated by printed messages


if TRUE, the time needed to process each haploblock is printed


a single directory, or not specified. inferHaplotypes uses lists that for each combination of marker dosages give all possible combinations of haplotype dosages. These lists (ahclist and ahccompletelist) are loaded and saved at the directory specified by ahcdir. If no ahcdir is specified it is set to the current working directory.
If an ahclist or ahccompletelist for the correct ploidy is already in GlobalEnv this is used and no new list is loaded, even if ahcdir is specified.


First we consider the case where one or more FS families and their parents are present in the set of samples. In that case, initially the possible haplotype configurations of the parents are determined. From that, all their possible gametes (assuming polysomic inheritance) are calculated and all possible FS haplotype configurations. Comparing this with the observed FS marker dosages the most likely parental and FS configurations are found.
It is possible that multiple parental combinations can explain the observed marker dosages in the FS. In that case, if one is clearly more likely and/or needs less haplotypes, that one is chosen. If there is no clear best solution still the parents and FS individuals that have the same haplotype configuration over all likely solutions are assigned that configuration.
For FS where no good solution is found (because of an error in the marker dosages of a parent, or because the correct solution was not considered) the parents and individuals will be considered as unrelated material.
If several FS families share common parents they are treated as a group, and only solutions are considered that are acceptable for all families in the group.
Finally (or if no FS families are present, immediately) the other samples are haplotyped, which are considered as unrelated material. If FS families have been solved the haplotypes in their parents are considered "known", and known haplotypes can also be supplied (parameter knownHap). For these samples we consecutively add haplotypes that must be present in a minimum number of individuals, always trying to minimize the number of needed haplotypes.
InferHaplotypes uses tables that, for each combination of dosages of the markers in the haploblock, list all haplotype combinations (ahc) that result in these marker dosages. In principle inferHaplotypes uses a list (ahccompletelist) that, for a given ploidy, has all the haplotype combinations for haploblocks from 1 up to some maximum number of markers. This list can be computed with function build_ahccompletelist. If this list is not available (or is some haploblocks contain more markers than the list), the ahc for the (extra) marker.
See the PolyHaplotyper vignette for an illustrated explanation.


a list with for each haploblock one item that itself is a list with items:
message; if this is "" the haploblock is processed and further elements are present; else this message says why the haploblock was skipped (currently only if it contains too many markers)
hapdos: a matrix with the dosages of each haplotype (in rows) for each individual (in columns). For each individual the haplotype dosages sum to the ploidy. If dropUnused is TRUE Only the haplotypes that occur in the population are shown, else all haplotypes
mrkdids: a vector of the mrkdid (marker dosage ID) for each individual (each combination of marker dosages has its own ID; if any of the markers has an NA dosage the corresponding mrkdid is also NA).
The mrkdids can be converted to the marker dosages with function mrkdid2mrkdos.
markers: a vector with the names of the markers in the haploblock
imputedGeno: a matrix in the same format as param mrkDosage, with one row for each marker in the haploblock and one column per imputed individual, with the dosages of the markers. These are the individuals that have incomplete data in mrkDosage but where the available marker dosages match only one of the expected marker genotypes in the FS family (only individuals in FS families are imputed). It is possible that an individual with imputed marker dosages is not haplotyped (as is the case for individuals with complete marker data) if the marker dosages match different possible haplotype combinations. The next elements are only present if one or more FS families were specified:
FSfit: a logical vector with one element per FS family; TRUE if a (or more than one) acceptable solution for the FS is found (although if multiple solution are found they might not be used if unclear which one is the best solution). (Even if no solution was found for an FS, still its individuals may have a haplotype combination assigned ignoring their pedigree)
FSmessages: a character vector with one item per FS family: any message relating to the fitting of a model for that FS, not necessarily an error
FSpval: a vector of the chi-squared P-value associated with the selected FS model for each FS family, or the maximum P value over all models in case none was selected
If for new combinations of marker dosages the possible haplotype combinations have to be calculated, an ahclist file is written to ahcdir


# this example takes about 1 minute to run:
results <- inferHaplotypes(mrkDosage=phdos, ploidy=6,
haploblock=phblocks, parents=phpar, FS=phFS)

PolyHaplotyper documentation built on June 17, 2021, 5:12 p.m.