R/PolyHaplotyper.R

Defines functions pedigreeSim2PH compareHapMrkDosages compareHapresults read.Happyinf.output run.SATlotyper read.SATlotyper.output read.ShesisPlus.output haploblock_df2list calcMrkHaptable calcStatistics calcFSstats calcPedstats overviewByFS pedigreeHapCheck pedigreeHapCheck_OneHaploblock checkFS_OneParent checkFS_TwoParents getFSfreqs_gamfrq getFSfreqs getGameteFreqs getHapcombFreq getFScombs getConsensusmrkDosage mergeReplicates usedhap allhap getHaploblockResults showOneFS inferHaplotypes addNewMrkdids getNewMrkdids getLinkedFSgroups hapOneBlock checkHaploblock checkpops solveOneFS testFSseg chisq.probtest inferHaps_noFS single_cycle_infer checkmrkDosage hapcomb_from_IHlist getHaplotypeNames padded haplofrqMinMax mrkdid2mrkdos mrkdos2mrkdid mrkdidfun findpos_wReplacement findcomb_wReplacement hapcomb2hapindex hapindex2hapcomb hapcomb2hapdos hapdos2hapcomb expandHapdos split_hapnames hapcomb2mrkdos hapdos2mrkdos build_ahccompletelist completeAllHaploComb totHapcombCount getHapcombCount_1mrk calcAllhapcomb4mrkdid getAllHapcomb loadAllHapcombLists getahcdir allHaplotypes

Documented in allhap allHaplotypes build_ahccompletelist calcMrkHaptable calcStatistics checkmrkDosage compareHapMrkDosages compareHapresults expandHapdos getFSfreqs getGameteFreqs hapcomb2hapdos hapdos2hapcomb hapdos2mrkdos haploblock_df2list inferHaplotypes mergeReplicates mrkdid2mrkdos mrkdos2mrkdid overviewByFS padded pedigreeHapCheck pedigreeSim2PH read.Happyinf.output read.SATlotyper.output read.ShesisPlus.output run.SATlotyper showOneFS totHapcombCount usedhap

# PolyHaplotyper

# A haploblock (or block) is a set of bi-allelic markers (SNPs) that are assumed
# to have no recombinations within the population studied. A good example is
# (the set of SNPs within) a contig.

# A haplotype is an allele of a haploblock: a unique combination of marker
# alleles.

# PolyHaplotyper aims to infer the genetic composition of individuals in
# terms of haplotype dosages for a haploblock, based on marker dosages, for any
# ploidy level.

# PolyHaplotyper should not be confused with PediHaplotyper, which is aimed
# at diploid individuals that are all part of a pedigree, and where the
# (bi- or multi-allelic) marker alleles have already been phased.
# In contrast, PolyHaplotyper can handle any ploidy level and uses unphased
# bi-allelic marker genotypes (i.e. marker allele dosages). It can work with
# unrelated individuals and/or independent or linked FS families but
# (so far) does not use or need an entire pedigree, except (optionally) to
# check for haplotype assignments that violate Mendelian inheritance, after the
# haplotyping has been completed.

# A note on the representation of genotypes:
# 1. genotypes in terms of biallelic marker dosages: these can be represented as
#    a. a vector with one element for each marker in the haploblock,
#       with the dosages of one of the marker alleles. Each marker dosage
#       is in 0:ploidy, or NA. This type of data often has mrkdos in the name
#    b. a single integer number: the index of the marker dosage combination
#       in an ordered list of all possible marker dosage combinations. This
#       is NA when at least one of the marker dosages in the haploblock is NA.
#       The index is 1-based and ranges from 1 to (ploidy+1)^nmrk, where nmrk
#       is the number of markers in the haploblock.
#       This index number is often called did (dosage ID) or mrkdid.
# 2. genotypes in terms of haplotype dosages: these can be represented as
#    a. a vector with one element for each of the 2^nmrk possible haplotypes
#       (where nmrk is the number of markers in the haploblock), with the
#       dosages of the haplotypes. The dosages sum to ploidy or all dosages
#       are NA. This type of data often has hapdos or had in the name.
#    b. an ordered vector of length ploidy with the haplotype numbers present
#       in the genotype (e.g. c(1, 5, 8, 8) : 1 copy of haplotypes 1 and 5,
#       and 2 copies of haplotype 8). This type of data often has hapcomb or
#       hac in the name (but comb is also used for other types of
#       combinations, e.g. different combinations of parental genotypes)
#    c. a single integer number: the index of the haplotype combination
#       in an ordered list of all possible haplotype combinations. The index
#       is 1-based and ranges from 1 to choose(nhap+ploidy-1, ploidy), where
#       nhap is the number of possible haplotypes: nhap = 2^nmrk.
#       This index number is often called hapindex or hapix.
#       This indexing fails with more than 8 markers (tetraploid) or 6 markers
#       (hexaploid) because the number of possible haplotype combinations
#       exceeds the capacity of a 32-bit integer.
#    d. an index to the hapcomb in ahc(complete)list: the column number in
#       the matrix at ahccompletelist[[nmrk]][[mrkdid]] (with mrkdid a numeric
#       or integer) or ahclist[[nmrk]][[mrkdid]] (with mrkdid a string).
#       This requires the ploidy, nmrk (the number of markers in the
#       haploblock) and mrkdid (the marker dosage ID, see above).
#       The hapcomb is obtained by
#       getAllHapcomb(did=mrkdid, allhap=allhap, ploidy=ploidy)[, colnum]

#'@importFrom utils combn flush.console read.table write.table
#'@importFrom stats chisq.test pchisq pbinom qbinom

# documentation of the objects in PolyHaplotyper_demo.RData:

#'@title dosages of SNP alleles
#'@description A data.frame with the allele dosages for 30 SNPs in 661
#'samples; used in vignette, stored in PolyHaplotyper_demo.RData
#'@docType data
#'@keywords datasets
#'@name demo_snpdos
#'@format a data.frame with 30 rows (SNPs) and 663 columns. The first
#'two columns contain the SNP name and contig; the remaining columns
#'contain the allele dosages (integers in 0..6 or NA) of the SNPs in
#'661 samples
NULL

#'@title pedigree
#'@description A data.frame with a pedigree; used in vignette, stored in
#'PolyHaplotyper_demo.RData
#'@docType data
#'@keywords datasets
#'@name demo_ped
#'@format a data.frame with 661 rows (samples) and
#'4 columns: genotype, mother, father, sample_nr
NULL

# documentation of the objects in PolyHaplotyper_small.RData:

#'@title dosages of SNP alleles
#'@description A matrix with the allele dosages for 8 SNPs in 661
#'samples; used in manuals, stored in PolyHaplotyper_small.RData
#'@docType data
#'@keywords datasets
#'@name phdos
#'@format a data.frame with 30 rows (SNPs) and 663 columns. The first
#'two columns contain the SNP name and contig; the remaining columns
#'contain the allele dosages (integers in 0..6 or NA) of the SNPs in
#'661 samples
NULL

#'@title pedigree
#'@description A data.frame with a pedigree; used in manuals, stored in
#'PolyHaplotyper_small.RData
#'@docType data
#'@keywords datasets
#'@name phped
#'@format a data.frame with 661 rows (samples) and
#'4 columns: genotype, mother, father, sample_nr
NULL

#'@title List of markers per haploblock
#'@description A list with for each haploblock the names of the markers it
#'contains; used in manuals, stored in PolyHaplotyper_small.RData
#'@docType data
#'@keywords datasets
#'@name phblocks
#'@format a list with 2 character vectors
NULL

#'@title parents of FS families
#'@description A matrix with the names of the parents of each FS family;
#'used in manuals, stored in PolyHaplotyper_small.RData
#'@docType data
#'@keywords datasets
#'@name phpar
#'@format a matrix with 4 rows, 1 per FS family, and 2 columns, 1 for each
#'parent
NULL

#'@title members of FS families
#'@description A list with for each FS family the names of the individuals it
#'contains; used in manuals, stored in PolyHaplotyper_small.RData
#'@docType data
#'@keywords datasets
#'@name phFS
#'@format a list with 4 vectors, each with the (here numeric) names of
#'the FS members
NULL

#'@title haplotyping results
#'@description A list with the results of function inferHaplotypes; used in
#'manuals, stored in PolyHaplotyper_small.RData
#'@docType data
#'@keywords datasets
#'@name phresults
#'@format a list with 2 elements, one per haploblock; each element itself
#'a list with multiple components
NULL

# Start of function definitions

#'@title get all haplotypes for the given markers
#'@description Given a set of bi-allelic (SNP) marker names, generate all
#'possible haplotypes
#'@usage allHaplotypes(mrknames)
#'@param mrknames the names of the (bi-allelic) markers in the
#'haploblock (contig)
#'@return a matrix with markers in columns and all possible (2 ^ nmrk)
#'haplotypes in rows. 0: haplotype contains the non-dosage-counted marker
#'allele (the reference allele); 1: haplotype contains the dosage-counted
#'(alternative) marker allele. The colnames are the marker names.
#'@examples
#'# show the 8 possible haplotypes with 3 bi-allelic markers:
#'allHaplotypes(mrknames=c("mrkA", "mrkB", "mrkC"))
#'@export
allHaplotypes <- function(mrknames) {
  lst <- list()
  for (s in 1:length(mrknames)) {
    lst[[s]] <- 0:1
  }
  eg <- as.matrix(expand.grid(lst, stringsAsFactors=FALSE))
  eg <- eg[, ncol(eg):1, drop=FALSE]
  colnames(eg) <- mrknames
  eg
} #allHaplotypes

getahcdir <- function(ahcdir=NULL) {
  # for internal use only; get ahcdir from (1) param ahcdir or (2) working dir
  # result ends always with "/" so a filename can be pasted onto it
  if (is.null(ahcdir)) ahcdir <- getwd() else {
    if (!is.character(ahcdir) || length(ahcdir) != 1)
      stop(paste("invalid ahcdir:", ahcdir))
    if (ahcdir=="") ahcdir <- getwd()
  }
  if (!dir.exists(ahcdir))
    stop(paste("non-existing ahcdir:", ahcdir))
  if (substr(ahcdir, nchar(ahcdir), nchar(ahcdir)) != "/")
    ahcdir <- paste0(ahcdir, "/")
  ahcdir
} #getahcdir

loadAllHapcombLists <- function(ploidy, nmrk, ahcdir) {
  # This function is called only at start of inferHaplotypes
  # ploidy: 1 integer
  # nmrk: 1 integer, the number of markers in the haploblock
  # result: a list with elements ahccompletelist, ahclist, complete_nmrk,
  #         ploidy, ahcdir

  checklist <- function(lst, ploidy, namespresent) {
    if (!is.list(lst) || length(lst) < 1) return(FALSE)
    if (!all(sapply(lst, class) == "list")) return(FALSE)
    if (length(lst[[length(lst)]]) == 0) return(FALSE) # in highest nmrk at least one mrkdid should be present
    # cl <- lapply(lst, FUN=function(x) sapply(x, class))
    # if (!all(unlist(cl) == "matrix")) return(FALSE)
    # due to change in class of matrix in R version 4:
    cl <- lapply(lst, FUN=function(x) sapply(x, inherits, "matrix"))
    if (!all(unlist(cl))) return(FALSE)
    cl <- lapply(lst, FUN=function(x) sapply(x, nrow))
    if (!all(unlist(cl) == ploidy)) return(FALSE)
    lnames <- names(lst[[length(lst)]])
    if (xor(!namespresent, (is.null(lnames) || any(lnames == "")))) return(FALSE)
    return(TRUE)
  } #checklist

  loadlist <- function(listname, ploidy, ahcdir, namespresent, message) {
    # try to load:
    if (message) cat(paste("loading ", listname, "..."))
    x <- tryCatch({
      fname <- paste0(ahcdir, listname, "_", ploidy, "x.RData")
      suppressWarnings(n <- load(fname))
      if (!(listname %in% n))
        return(paste("file", fname, "does not contain", listname))
    }, error=function(e) "error")
    if (identical(x, "error")) {
      if (message) cat (" not found!\n")
      return("") # not an error but no list file
    } else {
      #checks:
      if (!checklist(get(listname), ploidy, namespresent))
        return(paste(listname, " has incorrect format!\n"))
      if (message) cat (" done!\n")
      return(get(listname))
    }
  } #loadlist


  ahcinfo <- list(ahccompletelist=NULL, ahclist=NULL, complete_nmrk=0,
                  ploidy=ploidy, ahcdir=getahcdir(ahcdir))
  if (exists("ahccompletelist") && # in .GlobalEnv, loaded by user
      checklist(lst=ahccompletelist, ploidy=ploidy, namespresent=FALSE)) {
    ahcinfo$ahccompletelist <- ahccompletelist
  } else {
    ahccompletelist <-
      loadlist(listname="ahccompletelist", ploidy=ploidy, ahcdir=ahcinfo$ahcdir,
               namespresent=FALSE, message=TRUE)
    if (is.list(ahccompletelist)) {
      ahcinfo$ahccompletelist <- ahccompletelist
    } else if (ahccompletelist != "") stop(ahccompletelist)
  }
  if (!is.null(ahcinfo$ahccompletelist))
    ahcinfo$complete_nmrk <- length(ahcinfo$ahccompletelist)
  if (ahcinfo$complete_nmrk < nmrk) {
    # we need an additional ahclist for the mrkdids with larger nmrk
    if (exists("ahclist") && # in .GlobalEnv, loaded by user
        checklist(ahclist, ploidy, namespresent=TRUE)) {
      ahcinfo$ahclist <- ahclist
    } else {
      ahclist <- loadlist("ahclist", ploidy, ahcinfo$ahcdir,
                          namespresent=TRUE, message=FALSE)
      if (is.list(ahclist)) {
        ahcinfo$ahclist <- ahclist
      } else {
        if (ahclist != "") stop(ahclist)
        ahcinfo$ahclist <- list() # new empty ahclist
      }
    }
  }
  ahcinfo
} # loadAllHapcombLists

# not exported function (relies on ahcinfo)
#_#@title get all haplotype combinations that result in the given marker
#_#'dosages
#_#'@description get all haplotype combinations that result in the given
#_#'marker dosages
#_#'@usage getAllHapcomb(mrkDosage=NULL, mrkdid=NULL, nmrk, ahcinfo)
#_#'@param mrkDosage an integer vector with the dosages of each marker in one
#_#'individual; all dosages must be in 0:ahcinfo$ploidy or NA
#_#'Either mrkDosage or mrkdid must be specified, not both
#_#'@param mrkdid marker dosage ID: in combination with nmrk and ploidy specifying
#_#'the mrkDosage. Can be NA or an integer, numeric or character which must
#_#'specify an integer > 0
#_#'Either mrkDosage or mrkdid must be specified, not both
#_#'@param nmrk the number of markers (should match mrkDosage if that is specified)
#_#'@param ahcinfo a list as returned by loadAllHapcombLists
#_#'@details Each column of the return value represents one combination of
#_#'haplotypes that yields the observed marker dosages. If any of the marker
#_#'dosages, or mrkdid, are NA a matrix with 0 columns is returned.\cr
#_#'This function makes use of precalculated lists ahcinfo$ahccompletelist
#_#'and ahcinfo$ahclist; the mrkdid must be present in one of them.
#_#'@return an integer matrix with <ploidy> rows and one column for each haplotype
#_#'combination that matches the marker dosages; the haplotype numbers
#_#'within a column are in increasing order
getAllHapcomb <- function(mrkDosage=NULL, mrkdid=NULL, nmrk, ahcinfo) {
  # mrkDosage has already been checked at start of inferHaplotypes,
  # mrkdid are computed
  # so we don't check either, only find out of did or mrkDosage specified:
  if (is.null(mrkdid)) {
    if (anyNA(mrkDosage)) return(matrix(integer(0), nrow=ahcinfo$ploidy))
    mrkdid <- mrkdidfun(mrkDosage, ploidy=ahcinfo$ploidy)
  }
  if (is.na(mrkdid)) return(matrix(integer(0), nrow=ahcinfo$ploidy))
  if (nmrk <= ahcinfo$complete_nmrk)
    return(ahcinfo$ahccompletelist[[nmrk]][[as.integer(mrkdid)]])
  #debug check:
  # if (nmrk>length(ahcinfo$ahclist) ||
  #     !(mrkdid %in% names(ahcinfo$ahclist[[nmrk]])))
  #   stop("mrkdid not in ahc(complete)list in getAllHaplocomb")
  return(ahcinfo$ahclist[[nmrk]][[as.character(mrkdid)]])
} # getAllHapcomb

calcAllhapcomb4mrkdid <- function(mrkDosage=NULL, mrkdid=NULL, ploidy, allhap) {
  # mrkDosage: an integer vector with the dosages (0:ploidy) of nmrk markers
  # mrkdid: 1 integer (marker dosage id); if NULL, mrkDosage is used
  # ploidy: 1 integer
  # allhap: a matrix as returned by allHaplotypes, for nmrk markers
  # in the same order as mrkDosage
  # result: a matrix with <ploidy> rows and one column for each haplotype
  #         combination that matches mrkdid
  #
  # check if mrkdid or mrkDosage is valid (we assume ploidy and nmrk are valid)
  nmrk <- ncol(allhap)
  if (is.null(mrkdid)) {
    # check mrkDosage:
    if (length(mrkDosage) != nmrk || !all(mrkDosage %in% 0:ploidy))
      stop("invalid mrkDosage")
  } else {
    if (length(mrkdid)!=1 || is.na(mrkdid) || as.integer(mrkdid) != mrkdid ||
        mrkdid<1 || mrkdid > (ploidy+1)^nmrk)
      stop("invalid mrkdid")
    mrkDosage <- mrkdid2mrkdos(mrkdid, nmrk, ploidy)
  }
  #TODO: the next algorithm is less efficient if the SNP dosages are high than if they are
  #low (because with low dosages a haplotype combination will sooner exceed the dosage of
  #at least one of the SNPs). Therefore in these cases (with average SNP dosage > ploidy/2)
  #it may be more efficient to find a solution for the did with all SNP dosages set to
  #<ploidy> - actual dosage, and afterwards replacing the haplotypes by their inverses.
  allmat1 <- matrix(NA_integer_, nrow=ploidy, ncol=0) #matrix with haplotype sets in columns
  allset <- rep(nrow(allhap), ploidy) #current haplotype set: a full set has length ploidy
  while(length(allset) > 0) {
    # if the last allele is 0, change the next:
    if (length(allset) > 0 && allset[length(allset)] <= 0) {
      if (length(allset) == 1) {
        allset <- integer(0)
      } else allset <- allset[1:(length(allset) - 1)]
      if (length(allset) > 0)
        allset[length(allset)] <- allset[length(allset)] - 1
    } else if (all(colSums(allhap[allset,, drop=FALSE]) == mrkDosage)) {
      #solution found; fill up with nr 1 (dosage 0 for all markers) and add to allmat1
      allset <- c(allset, rep(1, ploidy-length(allset)))
      allmat1 <- cbind(allmat1, allset)
      allset[ploidy] <- 0 #make sure to change the next level
    } else {
      if (length(allset) < ploidy  &&
          !any(colSums(allhap[allset,, drop=FALSE]) > mrkDosage) &&
          allset[length(allset)] > 1) {
        # only add the next allele if none of the dosages too high;
        # adding allele 1 (all zeroes) won't lead to a solution
        allset <- c(allset, allset[length(allset)]) #allele nrs never increase
      } else {
        # change last allele
        allset[length(allset)] <- allset[length(allset)] - 1
      }
    }
  }
  if (ncol(allmat1) == 0) stop("getAllHapcomb: 0 solutions found")
  colnames(allmat1) <- NULL
  #reverse the row order:
  allmat1[nrow(allmat1):1,, drop=FALSE]
} # calcAllhapcomb4mrkdid

getHapcombCount_1mrk <- function(mrkdosage, ploidy, nhap) {
  # mrkdosage: the dosage of the marker (integer of length 1)
  # ploidy
  # nhap = total nr of different haplotypes for the haploblock: 2 ^ nmrk
  #returns the number of haplotype combinations for the given dosage at one
  #marker
  #currently not used
  d <- mrkdosage
  e <- ploidy - d
  h <- nhap %/% 2
  if (d == ploidy) C0 <- 1 else
    C0 <- cumprod(seq(h, h+e-1))[e] / cumprod(seq_len(e))[e]
  if (d == 0) C1 <- 1 else
    C1 <- cumprod(seq(h, h+d-1))[d] / cumprod(seq_len(d))[d]
  C0 * C1
} #getHapcombCount_1mrk

#'@title calculate the total nr of possible haplotype combinations
#'@description calculate the total nr of possible haplotype combinations
#'@usage totHapcombCount(ploidy, nmrk)
#'@param ploidy a vector of 1 or more ploidy levels
#'@param nmrk a vector of 1 or more numbers of markers per haploblock
#'@details nmrk is used to calculate the total number of possible haplotypes
#'= 2^nmrk. The shorter vector of ploidy and nmrk is recycled.
#'@return a vector with the number of possible haplotype combinations for
#'each pair of ploidy and nmrk values
#'@examples
#'totHapcombCount(ploidy=4, nmrk=c(1:8))
#'@export
totHapcombCount <- function(ploidy, nmrk) {
  nhap <- 2^nmrk
  choose(nhap + ploidy - 1, ploidy)
} #totHapcombCount

# not exported function - users should use build_ahccompletelist()
#_#'@title generate all haplotype combinations
#_#'@description generate a list which contains for each marker dosage combination
#_#'at a given ploidy all matching haplotype combinations
#_#'@usage completeAllHaploComb(ploidy, nmrk, savesec=1800, printsec=60,
#_#'fname=NULL)
#_#'@param ploidy ploidy (may be even or odd)
#_#'@param nmrk number of markers in the haploblock
#_#'@param savesec default 1800: number of seconds between successive saves of
#_#'the intermediate results
#_#'@param printsec default 60: number of seconds between printout of
#_#'the current set of haplotypes (the last two are always equal at the time of
#_#'printing). NA or 0 suppresses printing.
#_#'@param fname filename (to which the extension .RData will be added) to store
#_#'the results. Intermediate saves will go to fname + the current set of
#_#'haplotypes; these intermediate files are temporary and will be deleted.\cr
#_#'If NULL (default), fname will be set to e.g. ahc_4x_nmrk6, where 4 is the
#_#'ploidy and nmrk = 6
#_#'@details This function is used by build_ahccompletelist and would not
#_#'normally be called by the user
#_#'@examples
#_#'# Generate an ahccompletelist for a given ploidy, for 1 .. nmrk markers:
#_#'\dontrun{
#_#'ahccompletelist <- list(1:nmrk)
#_#'for (n in seq_len(nmrk)) {
#_#'  ahccompletelist[[n]] <-
#_#'    completeAllHaploComb(ploidy=ploidy, nmrk=n)
#_#'  save(ahccompletelist, file=paste0("ahccompletelist_", ploidy, "x.RData"))
#_#'}
#_#'Note that this takes lots of time and memory already for
#_#'ploidy 4, nmrk = 7 and ploidy 6, nmrk = 5
#_#'}
#_#'@return a list with for each mrkdid (each combination of marker dosages) a
#_#'matrix with one row per haplotype and one column per haplotype combination,
#_#'containing the dosages of the haplotypes in each haplotype combination.
completeAllHaploComb <- function(ploidy, nmrk, savesec=1800, printsec=60,
                                 fname=NULL) {
  if (is.null(fname)) fname <- paste0("ahc_", ploidy, "x_nmrk", nmrk)
  if (is.null(printsec) || length(printsec) !=1 || is.na(as.numeric(printsec)) ||
      printsec == 0) printsec <- NA

  ndids <- as.integer((ploidy + 1) ^ nmrk + 0.001)
  ahc <- list()
  for (did in ndids:1) ahc[[did]] <- matrix(NA_integer_, nrow=ploidy, ncol=1)
  lasthap <- rep(0, ndids) # last stored haplotype column for each did
  allhap <- allHaplotypes(paste0("m", 1:nmrk))
  lastfile <- ""
  lastsave <- proc.time()
  lastprint <- lastsave - printsec - 1
  # generate all hapcombs in turn and calculate their mrkdids:
  starttime <- proc.time()
  nhap <- nrow(allhap)
  allset <- rep(nhap, ploidy) #current haplotype set: a full set has length ploidy
  while(TRUE) {
    mrkdos <- colSums(allhap[allset,, drop=FALSE])
    did <- mrkdidfun(mrkdos, ploidy=ploidy)
    lasthap[did] <- lasthap[did] + 1
    if ((nc<-ncol(ahc[[did]])) < lasthap[did]) {
      #we don't want to reallocate for each new column but increase size
      #when needed by 50%
      extmat <- matrix(NA_integer_, nrow=ploidy, ncol= nc %/% 2 + 2)
      ahc[[did]] <- cbind(ahc[[did]], extmat)
    }
    ahc[[did]][,lasthap[did]] <- allset
    # get the next allset:
    len <- ploidy
    allset[len] <- allset[len] - 1
    while (len > 0 && allset[len] <= 0) {
      len <- len - 1
      if (len > 0) allset[len] <- allset[len] - 1
    }
    if (len <= 0) break
    if (len < ploidy) {
      allset[(len+1):ploidy] <- allset[len] #nhap
      #progress and save: only check when len < ploidy
      if (!is.na(printsec) && (proc.time()-lastprint)[3] > printsec) {
        cat(paste0("ploidy: ", ploidy, " nmrk: ", nmrk,
                   " allset: ", paste(allset, collapse=" "), " sec: ",
                   paste(round((proc.time()-starttime)[1:3], 3), collapse=" "), "\n"))
        lastprint <- proc.time()
        if ((lastsave-lastprint)[3] > savesec) {
          save(ahc, file=paste0(fname, "_", paste(allset, collapse="-"),
                                ".RData"),
               version=2)
          if (file.exists(lastfile)) file.remove(lastfile)
          lastfile <- fname
          lastsave <- proc.time()
        }
      }
    } # len < ploidy
  }
  # all elements of ahc are now matrices with one solution per column,
  # in the form of sets like 1-1-2-4 (i.e. all occurring haplotypes listed).
  # We still only need to cut off the unused but allocated columns
  # of the matrices if any; and also we reverse the order of the rows, so that
  # the haplotype numbers are increasing rather than decreasing:
  for (d in seq_along(ahc)) {
    ahc[[d]] <- ahc[[d]][ploidy:1, 1:lasthap[d], drop=FALSE]
  }
  save(ahc, file=paste0(fname, ".RData"), version=2)
  if (file.exists(lastfile)) file.remove(lastfile)
  invisible(ahc)
} #completeAllHaploComb

#'@title generate a list with all haplotype combinations
#'@description generate a list which contains for each marker dosage combination
#'at a given ploidy all matching haplotype combinations
#'@usage build_ahccompletelist(ploidy, maxmrk, savesec=1800, printsec=300,
#'overwrite, shorten=FALSE)
#'@param ploidy ploidy; may be even or odd, but inferHaplotypes only works
#'with even ploidy if FS families are present, so building an ahccompletelist
#'for odd ploidy is probably not useful
#'@param maxmrk the list will countain all haplotype combinations for
#'haploblocks of 1...maxmrk markers
#'@param savesec default 1800: number of seconds between successive saves of
#'the intermediate results
#'@param printsec default 300: number of seconds between printout of
#'the current set of haplotypes. NA or 0 suppresses printing.
#'@param overwrite the new file ahccompletelist_<ploidy>x.RData is written in
#'the working directory. If a file with that name already exists and would be
#'changed, this is aborted and the old file maintained if overwrite is FALSE
#'@param shorten if file ahccompletelist_<ploidy>x.RData already exists and
#'covers haploblocks of lengths longer than maxmrk,
#'this file is left unchanged if shorten is FALSE (default), but is shortened
#'to maxmrk if shorten is TRUE
#'@details
#'An ahccompletelist reduces the processing time of inferHaplotypes enormously
#'but takes a long time to build. This should therefore be done when
#'PolyHaplotyper will be used multiple times.\cr
#'If an ahccompletelist file already exists in the working directory, this file is
#'used as starting point; if maxmrk is larger than the length of the existing
#'list only the additional items are calculated.\cr
#'If this function crashes (due to exceeding the memory limits of the computer
#'or of R) the file ahccompletelist_<ploidy>x.RData in the working directory
#'contains an intact version of the ahccompletelist for the last finished
#'marker count. Also, the temporary file
#'"buildAHCcompletelist_<datetime>"_"<nmrk>.RData" contains a part of the
#'list item for the number of markers being calculated at that moment.\cr
#'Note that this function takes lots of time and memory already for
#'ploidy 4 / maxmrk 7 and ploidy 6 / maxmrk 5, and is not practicable above
#'ploidy 4 / maxmrk 8, ploidy 6 / maxmrk 6, ploidy 8 / maxmrk 5. Also, the
#'files are large and take up considerable memory, and most of the size is
#'used for the largest haploblock size. Therefore maxmrk should
#'not be taken larger than necessary.
#'@return NULL (invisible); the actual result is a file in the working directory
#'named ahccompletelist_<ploidy>x.RData
#'@examples
#'\donttest{
#'# this example will create a file ahccompletelist_4x.RData
#'# in the working directory, for ploidy=4x and 1 - 4 markers:
#'build_ahccompletelist(ploidy=4, maxmrk=4, overwrite=FALSE)
#'}
#'@export
build_ahccompletelist <- function(ploidy, maxmrk, savesec=1800, printsec=300,
                                 overwrite, shorten=FALSE) {
  fname <- paste0("ahccompletelist_", ploidy, "x.RData")
  tmpfname <- paste0("buildAHCcompletelist_", gsub("[^0-9]", "", Sys.time()), "_")
  ahccompletelist <- NULL
  if (missing(overwrite) || length(overwrite) != 1 || is.na(overwrite) ||
      !is.logical(overwrite))
    stop("overwrite must be FALSE or TRUE")
  if (file.exists(fname)) { # in working directory
    x <- load(fname) # contains ahccompletelist
    if (length(x) != 1 || x !="ahccompletelist")
      stop(paste("the existing file", fname, "does not contain (only) ahccompletelist"))
    if (length(ahccompletelist) == maxmrk ||
        (length(ahccompletelist) > maxmrk && !shorten)) {
      # file in working dir already ok
      return(invisible(NULL))
    }
    if (!overwrite)
      stop(paste("file", fname, "already exists in working directory; aborted"))
  } else {
    ahccompletelist <- vector(0, mode="list")
  }
  newmaxmrk <- ifelse(shorten, maxmrk, length(ahccompletelist))
  if (length(ahccompletelist) >= maxmrk) {
    ahccompletelist <- ahccompletelist[1:newmaxmrk]
    if (file.exists(fname)) file.rename(from=fname, to=paste0(fname,".backup"))
    save(ahccompletelist, file=fname, version=2)
    if (file.exists(paste0(fname,".backup"))) file.remove(paste0(fname,".backup"))
  } else {
    if (totHapcombCount(ploidy=ploidy, nmrk=maxmrk) > 2e9) {
      ans <- readline("this will result in a very large list and may crash the computer; proceed (y/n)? ")
      if (!(ans %in% c("Y", "y")))
        stop("aborted")
    }
    for (nmrk in (length(ahccompletelist)+1):maxmrk) {
      cat(paste0("starting block size ", nmrk, " markers\n"))
      if (file.exists(fname)) file.rename(from=fname, to=paste0(fname,".backup"))
      ahccompletelist[[nmrk]] <-
        completeAllHaploComb(ploidy=ploidy, nmrk=nmrk, savesec=savesec,
                             printsec=printsec,
                             fname=paste0(tmpfname, nmrk))
      save(ahccompletelist, file=fname, version=2)
      cat(paste("saved", fname, "for block sizes up to", nmrk, "markers\n"))
      fn <- paste0(tmpfname, nmrk, ".RData")
      if (file.exists(fn)) file.remove(fn) #left by completeAllHaploComb
      fn <- paste0(fname,".backup")
      if (file.exists(fn)) file.remove(fn)
    }
  }
  invisible(NULL)
} #buildAHCcompletelist

#'@title calculate the marker dosages resulting from haplotype dosage
#'combinations
#'@description calculate the marker dosages resulting from haplotype dosage
#'combinations
#'@usage hapdos2mrkdos(hapdos, allhap)
#'@param hapdos a matrix with one column per combination of haplotypes
#'and one row for each possible haplotype (corresponding to the rows of allhap)
#'with dosage of the haplotypes in each combination. A vector is interpreted
#'as a one-column matrix; all columns must sum to ploidy. The rownames of
#'the matrix (or names of the vector) must contain the haplotype numbers
#'@param allhap a matrix as returned by allHaplotypes
#'@return a matrix with columns corresponding to the columns of hapdos
#'and one row for each marker, with the dosages of each marker in each combination;
#'colnames are the mrkdids (marker dosage IDs), rownames are the marker names taken
#'from allhap
#'@details if hapdos contains NA values, all values in the corresponding
#'column of the result will also be NA
#'@examples
#'# get a matrix of all haplotypes with the 3 specified markers:
#'ah <- allHaplotypes(mrknames=c("mrkA", "mrkB", "mrkC"))
#'# specify haplotype dosages of 4 tetraploid individuals,
#'# only the 3 occurring haplotypes (1, 5 and 6) are given:
#'haplodosg <-
#'  matrix(c(1,2,1, 4,0,0, 0,4,0, 0,0,4), nrow=3,
#'         dimnames=list(paste0("demohap_", c(1,5,6)), paste0("indiv", 1:4)))
#'# calculate the corresponding marker (SNP) dosages:
#'hapdos2mrkdos(hapdos=haplodosg, allhap=ah)
#'@export
hapdos2mrkdos <- function(hapdos, allhap) {
  if (is.vector(hapdos)) hapdos <- as.matrix(hapdos)
  hapdos <- expandHapdos(hapdos, nhap=nrow(allhap))
  res <- t(t(hapdos) %*% allhap)
  colnames(res) <- mrkdos2mrkdid(res, ploidy=sum(hapdos[, 1]), check=FALSE)
  res
} #hapdos2mrkdos

hapcomb2mrkdos <- function(hapcomb, allhap) {
  # similar to hapdos2mrkdos, but with haplotype combinations specified as
  # <ploidy> haplotype id's instead of the dosages of each haplotype
  if (is.vector(hapcomb)) hapcomb <- as.matrix(hapcomb)
  if (!is.matrix(allhap)) stop("allhap must be a matrix")
  if (!all(hapcomb %in% seq_len(nrow(allhap))))
    stop("hapcomb2mrkdos: input error")
  res <- matrix(NA_integer_, nrow=ncol(allhap), ncol=ncol(hapcomb),
                dimnames=list(colnames(allhap), NULL))
  for (i in seq_len(ncol(hapcomb))) {
    tmp <- allhap[hapcomb[, i],, drop=FALSE]
    res[, i] <- colSums(tmp)
  }
  colnames(res) <- mrkdos2mrkdid(res, ploidy=nrow(hapcomb), check=FALSE)
  res
} #hapcomb2mrkdos

split_hapnames <- function(hapnames) {
  # hapnames: a character vector of haplotype names, all starting with the same
  #           contig name, with the haplotype number appended after an
  #           underscore. Or a numeric vector with just the haplotype numbers.
  # return value: a list with 2 elements:
  # $ prefix: a character of length 1 with the entire (common) haplotype name
  #           up to and including the underscore separating the haplotype
  #           number; "x_" in case the hapnames consist of only the haplotype
  #           numbers
  # $ hapnrs: an integer vector with the haplotype numbers
  if (is.null(hapnames)) hapnames <- integer(0)
  if (length(hapnames) == 0) return(list(prefix="x_", hapnrs=integer(0)))
  #are the names just integers?
  suppressWarnings( hapnrs <- as.integer(hapnames) )
  if (!anyNA(hapnrs)) {
    prefix <- ""
  } else {
    #assume that all haplotype names are composed as
    # <haploblock name>_<hapnr>
    pos <- gregexpr("_", hapnames)
    num <- sapply(pos, length)
    if (any(num==0) || max(num) != min(num)) stop("haplotype names invalid")
    # same nr of underscores (1 or more) in all haplotype names
    lastpos <- sapply(pos, FUN="[", num[1])
    if (!min(lastpos) == max(lastpos)) stop("haplotype names invalid")
    # last underscore at same position in all haplotype names
    prefix <- substring(hapnames[1], 1, lastpos[1])
    hapnrs <- as.integer(substring(hapnames, lastpos[1]+1, nchar(hapnames)))
  }
  if (anyNA(hapnrs) || !all(hapnrs > 0) || !all(diff(hapnrs) > 0))
    stop("haplotype names invalid")
  return(list(prefix=prefix, hapnrs=hapnrs))
} #split_hapnames

#'@title Add dropped rows back to haplotype dosage matrix
#'@description Haplotype dosage matrices generated with the dropUnused=TRUE
#'lack some haplotypes that are required by pedigreeHapCheck; this
#'function adds them back.
#'@usage expandHapdos(hapdos, nhap, hbname)
#'@param hapdos a haplotype dosage matrix as returned by a.o. inferHaplotypes.
#'The rownames are assumed to have all the same length:
#'they are composed of the haploblock name to which the 0-padded haplotype number
#'is appended, separated by "_"
#'@param nhap the total number of haplotypes for the haploblock:
#'nhap = 2^nmrk with nmrk the number of markers in the haploblock
#'@param hbname the haploblock name (only used when hapdos has 0 rows)
#'@return a matrix similar to hapdos with rows for the dropped haplotypes
#'re-inserted, with the correct rownames and containing only 0 and NA values
#'@examples
#'# specify haplotype dosages of 4 tetraploid individuals,
#'# only the 3 occurring haplotypes (1, 5 and 6) are given:
#'haplodosg <- matrix(c(1,2,1, 4,0,0, 0,4,0, 0,0,4), nrow=3,
#'                    dimnames=list(paste0("demohap_", c(1,5,6)), paste0("indiv", 1:4)))
#'# add the rows for the absent haplotypes, assuming the haploblock consists
#'# of 3 markers, so 8 haplotypes are possible:
#'expandHapdos(hapdos=haplodosg, nhap=8)
#'@export
expandHapdos <- function(hapdos, nhap, hbname="x") {
  if (!is.matrix(hapdos)) stop("invalid hapdos, must be matrix")
  if (nrow(hapdos) == 0)
    return(matrix(NA_integer_, nrow=nhap, ncol=ncol(hapdos),
                  dimnames=list(
                    getHaplotypeNames(haploblock=hbname, hapcount=nhap),
                    colnames(hapdos))))
  hapnrs <- split_hapnames(rownames(hapdos)) # error if no rownames and other checks
  if (nhap == nrow(hapdos)) return(hapdos)
  hbname <- hapnrs$prefix
  hapnrs <- hapnrs$hapnrs
  if (hapnrs[length(hapnrs)] > nhap) stop("invalid hapdos")
  had <- matrix(NA_integer_, nrow=nhap, ncol=ncol(hapdos),
                dimnames=list(
                  getHaplotypeNames(haploblock=hbname, hapcount=nhap),
                  colnames(hapdos)))
  NAcols <- is.na(colSums(hapdos))
  had[hapnrs,] <- hapdos
  newrows <- setdiff(1:nhap, hapnrs)
  had[newrows, !NAcols] <- 0
  had
} #expandHapdos


#'@title convert haplotype dosages to haplotype combinations
#'@description convert a vector or matrix of haplotype dosages to a vector
#'or matrix of haplotype combinations
#'@usage hapdos2hapcomb(hapdos, ploidy)
#'@param hapdos a vector with one item for some or all possible haplotypes,
#'with the dosage of the haplotypes, summing to ploidy; or a matrix where
#'each column is such a vector, each representing an individual. Missing data
#'are allowed (but any NA makes the entire vector or matrix column unknown);
#'if a matrix, 0 rows indicate missing values for all individuals.
#'When hapdos is a matrix, the haplotype numbers must be given as rownames;
#'when a vector, as names. Not all possible haplotype numbers need to be present
#'but the ones that are present must be in ascending order, no duplicates
#'or missing values allowed.
#'@param ploidy the ploidy (one number)
#'@return a sorted vector of length ploidy with all haplotype numbers present,
#'or a matrix where each column is such a vector, with the same
#'colnames as hapdos
#'@examples
#'# specify haplotype dosages of 4 tetraploid individuals,
#'# only the 3 occurring haplotypes (1, 5 and 6) are given:
#'haplodosg <- matrix(c(1,2,1, 4,0,0, 0,4,0, 0,0,4), nrow=3,
#'                    dimnames=list(paste0("demohap_", c(1,5,6)), paste0("indiv", 1:4)))
#'# usage with hapdos as matrix:
#'hapdos2hapcomb(hapdos=haplodosg, ploidy=4)
#'# usage with hapdos as vector:
#'hapdos2hapcomb(hapdos=haplodosg[, 1], ploidy=4)
#'@export
hapdos2hapcomb <- function(hapdos, ploidy) {
  fn <- function(x, hapnrs) unlist(mapply(rep, hapnrs, x))

  if (missing(ploidy)) stop("ploidy not specified")
  vec <- !is.matrix(hapdos)
  if (vec) hapdos <- as.matrix(hapdos) #names become rownames
  result <- matrix(NA_integer_, nrow=ploidy, ncol=ncol(hapdos),
                   dimnames=list(NULL, colnames(hapdos)))
  cs <- colSums(hapdos) #all sum to ploidy or NA
  if (nrow(hapdos) > 0 && !all(is.na(cs))) {
    if (!all(cs %in% c(NA, ploidy)))
      stop("not all columns of hapdos sum to ploidy")
    if (is.null(rownames(hapdos))) {
      # we include this to allow more general use of hapdos2hapcomb
      hapnrs <- seq_len(nrow(hapdos))
    } else hapnrs <- split_hapnames(rownames(hapdos))$hapnrs
    NAcols <- is.na(cs)
    tmp <- apply(hapdos[, !NAcols, drop=FALSE], MARGIN=2, FUN=fn, hapnrs)
    result[, !NAcols] <- tmp
  }
  if (vec) result <- result[, 1, drop=TRUE] #rownames become names
  result
} #hapdos2hapcomb

#'@title convert haplotype combinations to haplotype dosages
#'@description converts matrices that contain haplotype combinations in columns
#'(as the matrices in ahccompletelist or ahclist) to matrices with the haplotype
#'dosages
#'@usage hapcomb2hapdos(hapcomb, nhap)
#'@param hapcomb matrix with <ploidy> rows and any number of columns. Each column
#'contains a set of haplotypes (numbers in 1 ... nhap). NAs allowed but no
#'values outside this range (not checked). E.g. a column with numbers
#'1-1-3-12 means two copies of haplotype 1 and one of haplotypes 3 and 12 each.
#'@param nhap the total number of possible haplotypes: 2^nmrk where nmrk is the
#'number of bi-allelic markers in the haploblock
#'@return matrix with nhap rows (one row for each possible haplotype) and as many
#'columns as in hapmat, giving the dosages of each haplotype in each column
#'of hapmat.
#'@examples
#'# specify haplotype combinations of 2 individuals:
#'haplocomb <- matrix(c(1,5,5,6, 5,6,6,6), ncol=2,
#'                    dimnames=list(NULL, c("indiv1", "indiv2")))
#'# convert to dosage matrix,
#'# assuming haploblock has 3 markers, so 8 possible haplotypes:
#'hapcomb2hapdos(hapcomb=haplocomb, nhap=8)
#'@export
hapcomb2hapdos <- function(hapcomb, nhap) {
  if (!is.matrix(hapcomb)) hapcomb <- matrix(hapcomb, ncol=1)
  cs <- colSums(hapcomb)
  hapdos <- matrix(NA_integer_, nrow=nhap, ncol=ncol(hapcomb),
                   dimnames=list(1:nhap, colnames(hapcomb)))
  for (col in which(!is.na(cs)))
    hapdos[, col] <- tabulate(hapcomb[, col], nbins=nhap)
  # version with apply is about 5% slower:
  # hapcombs <- as.integer(hapcomb) #prevent tabulate error if all are NA (logical)
  # hapdos <- apply(hapcombs, MARGIN=2, FUN=tabulate, nbins=nhap)
  # also this version has 0's instead of NA's in columns with missing data
  return(hapdos)
} #hapcomb2hapdos

hapindex2hapcomb <- function(hapindex, nhap, ploidy) {
  # hapindex: the index number of the desired haplotype combination (hapcomb)
  #           in a sorted sequence of all possible hapcombs at this ploidy
  #           and nhap
  # nhap: the number of possible haplotypes: 2^nmrk, with nmrk the number of
  #       markers in the haploblock
  # ploidy: the ploidy
  # return: an ordered vector of length ploidy with all haplotype numbers
  #         present, e.g. c(1, 5, 8, 8) : 1 copy of 1 and 5, 2 of 8
  # See function findcomb_wReplacement for details
  findcomb_wReplacement(n=nhap, k=ploidy, p=hapindex, check=FALSE)
}

hapcomb2hapindex <- function(hapcomb, nhap) {
  # hapcomb: an ordered vector of length ploidy with all haplotype numbers
  #          present, e.g. c(1, 5, 8, 8) : 1 copy of 1 and 5, 2 of 8
  # nhap: the total number of possible haplotypes: 2^nmrk
  # return: the index number of this hapcomb in a sorted sequence of all
  #         possible hapcombs at this ploidy and nmrk. See the
  #         findpos_wReplacement function
  findpos_wReplacement(n=nhap, comb=hapcomb, check=FALSE)
}

##!title Find a specific combination-with-replacement
##!description Find a specific combination-with-replacement assuming
##!the combinations are ordered
##!usage findcomb_wReplacement(n, k, p, check=TRUE)
##!param n a single integer > 0: the number of different object types
##!(which are numbered 1 to n) available
##!param k a single integer >= 0: the number of objects in each combination
##!param p a single integer: the 1-based index of the combination (see details)
##!param check if TRUE (default) the input data are checked against
##!incorrect length or values. If FALSE and n<=0 or k<0 or p outside the range
##!of possible combinations an incorrect value (even of incorrect length) may be
##!returned without warning.
##!details Assume that all possible combinations-with-replacement are ordered like
##!(for n=3, k=3) 111, 112, 113, 122, 123, 133, 222, 223, 233, 333 (there are
##!choose(n+k-1, k) such combinations). p indices one of these combinations
##!and this is returned as an integer vector with k elements.\cr
##!If k == 0 there is one combination: integer(0).
##!return an integer vector with the selected combination; if p does not index
##!a combination (p <= 0 or p > total number of combinations) NA is returned
##!(in that case the length of the returned vector is 1, not k)
findcomb_wReplacement <- function(n, k, p, check=TRUE) {

  findcomb_as_Delimiterpos <- function(n, k, p) {
    # n = nr of object types (colors, letters etc)
    # k = number of objects (balls) to select
    # p = 1-based index of target combination
    # return = positions of delimiters at index p
    p <- p - 1
    if (n <= 1 || k < 1) return(integer(0))
    if (n == 2) return(k+1-p)
    nd <- n-1 #nr of delimiters: 1 - nr of colors
    pos <- seq(n+k-nd, n+k-1) #original positions of delimiters, all at right
    for (j in 1:(nd-1)) {
      s <- 0 #cumulative nr of accounted-for combinations with this delimiter
      while (TRUE) {
        N <- choose(nd+k-pos[j], nd-j)
        if (s + N <= p) {
          pos[j] <- pos[j] - 1
          s <- s + N
        } else break
      }
      p <- p - s
    }
    #last delimiter:
    pos[nd] <- pos[nd] - p
    pos
  } #findcomb_as_Delimiterpos

  delimiterpos2comb <- function(pos, k) {
    # pos = vector of delimiter positions
    # k = number of objects (balls) to select
    # return = selected balls (ordered by increasing number, 1-based)
    if (length(pos) == 0) return(rep(1, k))
    pos <- pos - seq_along(pos)
    result <- integer(k)
    last <- 0
    for (j in 1:length(pos)) if (pos[j] > last) {
      result[(last+1):pos[j]] <- j
      last <- pos[j]
    }
    if (k > last) result[(last+1):k] <- length(pos) + 1
    result
  } # delimiterpos2comb

  if (check) {
    n <- as.integer(n); k <- as.integer(k); p <- as.integer(p)
    if (length(n) != 1 || n <= 0) stop ("n must be a single integer > 0")
    if (length(k) != 1 || n < 0) stop ("k must be a single integer >= 0")
    if (length(p) != 1) stop("p must be a single integer")
    combcount <- choose(n+k-1, k)
    if (p <= 0 || p > combcount) return(NA)
  }
  if (k == 0) {
    if (p == 1) return(integer(0)) else return(NA)
  }
  delimpos <- findcomb_as_Delimiterpos(n, k, p)
  delimiterpos2comb(delimpos, k)
} # findcomb_wReplacement

# don't export this function
#_#'@title Find the index number of a combination-with-replacement
#_#'@description Find the index number of a combination-with-replacement assuming
#_#'the combinations are ordered
#_#'@usage findpos_wReplacement(n, comb, check=TRUE)
#_#'@param n a single integer > 0: the number of different object types
#_#'(which are numbered 1 to n) available
#_#'@param comb an integer vector representing the combination of objects;
#_#'sorted from small to large; all items must be in 1:n
#_#'@param check if TRUE (default) the input data are checked against
#_#'incorrect length or values and an eror is generated if needed; also
#_#'comb is sorted first. If FALSE and n<=0 or comb not sorted correctly
#_#'or any elements of comb not in 1:n an incorrect value
#_#'may be returned without warning.
#_#'@details Assume that all possible combinations-with-replacement are ordered like
#_#'(for n=3, k=3) 111, 112, 113, 122, 123, 133, 222, 223, 233, 333 (there are
#_#'choose(n+k-1, k) such combinations). This function returns the index number
#_#'of a given combination in this sequence.\cr
#_#'If comb == integer(0), i.e. an empty combination, the index number is 1
#_#'as there is just one way of selecting 0 elements from any collection.\cr
#_#'This function relies on integer accuracy. Above 2^31-1 (2.1e9) R uses doubles
#_#'instead of integers and at some point exact
#_#'accuracy is lost. This can be overcome by using integer64 from package bit64
#_#'for the intermediate (ix) and returned results (and an integer64 - adapted
#_#'version of choose) - the maximum value is then 2^63-1 or 9.2e18.
#_#'@return an integer: the position (index number) of the given combination,
#_#'1-based.
findpos_wReplacement <- function(n, comb, check=TRUE) {
  #this function is used to assign an ID (index nr) to haplotype combinations
  #for easy comparing and countig in solveOneBlock
  if (check) {
    if (length(n) != 1 || n <= 0) stop("n must be a single integer > 0")
    if (anyNA(comb) || any(!(comb %in% 1:n)))
      stop("all elements in comb must be in 1:n")
    comb <- sort(comb)
  }
  k <- length(comb)
  if (k == 0) return(1)
  if (k == 1) return(comb)
  ix <- 1
  for (p in seq_len(length(comb)-1)) {
    #p is the current position within the (ordered) combination
    #find the index ix of the first combination with comb[p] at position p:
    #preceding colors at pos p to calculate nr of combs for:
    if (p == 1) start <- 1 else start <- comb[p-1]
    end <- comb[p] - 1
    x <- start # x loops through the preceding colors at position p
    while (x <= end) {
      tmpn <- n - x + 1
      tmpk <- k - p
      ix <- ix + choose(tmpn + tmpk - 1, tmpk)
      x <- x + 1
    }
  }
  # last position:
  ix <- ix + comb[length(comb)] - comb[length(comb)-1]
  ix
} #findpos_wReplacement

mrkdidfun <- function(dos, ploidy) {
  # for use within mrkdos2mrkdid and getAllHapcomb
  # dos: vector with marker dosages (length is nr of markers in the haploblock)
  # Fun fact: by using ploidy==1 and only dosages 0 and 1, this converts
  # a haplotype to its hapnr
  # return: the mrkdid (marker dosage combination ID)
  id <- 0
  fac <- 1
  for (i in length(dos):1) {
    id <- id + fac * dos[i]
    fac <- fac * (ploidy + 1)
  }
  id +1
} #mrkdidfun

#'@title get marker dosage IDs from marker dosages
#'@description get marker dosage IDs (mrkdid) from marker dosages
#'@usage mrkdos2mrkdid(mrkDosage, indiv=NULL, ploidy, check=TRUE)
#'@param mrkDosage matrix or data.frame. Markers are in rows, individuals in
#'columns, each cell has a marker dosage. Names of individuals are the column
#'names, marker names are the row names or (if a data.frame) in a column named
#'MarkerNames. All marker dosages must be in 0:ploidy or NA. If a data.frame,
#'additional columns may be present.
#'@param indiv NULL (default) or a character vector with names of individuals
#'to be selected. If NULL, all columns are selected;
#'if mrkDosage is a data.frame, that is probably not what is intended.
#'@param ploidy all marker dosages are checked to be in 0:ploidy or NA
#'@param check if TRUE (default) checkmrkDosage is called. If FALSE it is
#'assumed that mrkDosage is a matrix (not a data.frame) and it is not checked.
#'@details with ploidy==1 and (of course) all dosages 0 or 1 this function
#'returns the haplotype numbers for the haplotype specified by each column
#'@return a vector of marker dosage IDs, one for each column of mrkDosage:
#'each a number in 1:((ploidy+1)^nrow(mrkDosage)), NA for each column in
#'dosages where any of the dosages are NA
#'@examples
#'# dosages of 3 markers in 3 tetraploid individuals:
#'mrkdosg <-
#'  matrix(c(1,2,2, 4,0,0, 3,0,2), nrow=3,
#'         dimnames=list(c("mrkA", "mrkB", "mrkC"), c("indiv1", "indiv2", "indiv3")))
#'# get the "marker dosage IDs":
#'dids <- mrkdos2mrkdid(mrkDosage=mrkdosg, ploidy=4)
#'# convert dids back to marker dosages:
#'mrkdid2mrkdos(dosageIDs=dids, nmrk=3, ploidy=4, mrknames=c("mrkA", "mrkB", "mrkC"))
#'@export
mrkdos2mrkdid <- function(mrkDosage, indiv=NULL, ploidy, check=TRUE) {
  if (check) {
    mrkDosage <- checkmrkDosage(mrkDosage=mrkDosage, ploidy=ploidy,
                                indiv=indiv)
    if (is.character(mrkDosage)) stop(mrkDosage)
  }
  apply(mrkDosage, MARGIN=2, FUN=mrkdidfun, ploidy=ploidy)
} #mrkdos2mrkdid

#'@title get the marker dosages from mrkdids (marker dosage IDs)
#'@description get the marker dosages from mrkdids (marker dosage IDs)
#'@usage mrkdid2mrkdos(dosageIDs, nmrk, ploidy, mrknames=NULL)
#'@param dosageIDs vector of marker-dosage-combination IDs (mrkdid)
#'@param nmrk character vector of marker names in the haploblock
#'@param ploidy the ploidy level, a single positive integer
#'@param mrknames a vector of nmrk marker names (default NULL): if not NULL
#'these are used as rownames of the returned matrix
#'@return a matrix with in columns the marker dosages corresponding to the
#'marker dosageIDs, with these mrkdids as colnames, and one row per marker,
#'with marker names as rownames if mrknames are specified
#'
#'@examples
#'# dosages of 3 markers in 3 tetraploid individuals:
#'mrkdosg <-
#'  matrix(c(1,2,2, 4,0,0, 3,0,2), nrow=3,
#'         dimnames=list(c("mrkA", "mrkB", "mrkC"), c("indiv1", "indiv2", "indiv3")))
#'# get the "marker dosage IDs":
#'dids <- mrkdos2mrkdid(mrkDosage=mrkdosg, ploidy=4)
#'# convert dids back to marker dosages:
#'mrkdid2mrkdos(dosageIDs=dids, nmrk=3, ploidy=4, mrknames=c("mrkA", "mrkB", "mrkC"))
#'@export
mrkdid2mrkdos <- function(dosageIDs, nmrk, ploidy, mrknames=NULL) {
  mrkdos <- matrix(NA_integer_, nrow=nmrk, ncol=length(dosageIDs))
  if (!is.null(mrknames)) {
    if (length(mrknames) != nmrk) stop ("nmrk and mrknames don't match")
    rownames(mrkdos) <- mrknames
  }
  colnames(mrkdos) <- dosageIDs # no problem with NA
  if (is.character(dosageIDs)) dosageIDs <- as.integer(dosageIDs)
  uID <- unique(dosageIDs)
  pp1 <- ploidy + 1
  for (i in seq_along(uID)) {
    if (!is.na(uID[i]) && uID[i] >= 1 && uID[i] <= pp1^nmrk) {
      mdos <- integer(nmrk)
      di <- uID[i] - 1
      fac <- as.integer(round(pp1 ^ (nmrk-1), 0))
      m <- 1
      while (m <= nmrk) {
        mdos[m] <- di %/% fac
        di <- di %% fac
        fac <- fac %/% pp1
        m <- m + 1
      }
      mrkdos[,colnames(mrkdos)==uID[i]] <- mdos
    }
  }
  mrkdos
} #mrkdid2mrkdos

##!title get the min and max dosage of each haplotype over all
##!haplotype combinations
##!description get the min and max dosage of each haplotype over all
##!haplotype combinations
##!usage haplofrqMinMax(hapdos)
##!param hapdos a matrix with one row for each haplotype and one column per
##!combination of haplotypes (each column summing to ploidy);
##!a vector is interpreted as a one-column matrix
##!return a matrix with haplotypes in rows, and columns min and max with the
##!minimum and maximum times each haplotype occurs over all different
##!haplotype combinations
haplofrqMinMax <- function(hapdos) {
  if (is.vector(hapdos)) hapdos <- as.matrix(hapdos)
  minmax <- matrix(NA_integer_, nrow=nrow(hapdos), ncol=2)
  colnames(minmax) <- c("min", "max")
  minmax[,1] <- apply(hapdos, 1, min)
  minmax[,2] <- apply(hapdos, 1, max)
  minmax
} #haplofrqMinMax

#'@title pad an integer (prefix with zeroes to a fixed length)
#'@description pad an integer (prefix with zeroes to a fixed length)
#'@usage padded(x, maxx=0)
#'@param x vector of non-negative integers
#'@param maxx a single integer to whose nchar all x elements will be padded;
#'if 0 (default) the largest value in x will be used
#'@return a character vector representing the values of x left-padded with 0's
#'to the length of integer maxx or of max(x)
#'@examples
#'padded(c(21, 1, 121, NA, 0))
#'padded(c(21, 1, 121, NA, 0), maxx=1000)
#'@export
padded <- function(x, maxx=0) {
  formatC(x, width=nchar(max(x, maxx, na.rm=TRUE)), flag="0")
} #padded

getHaplotypeNames <- function(haploblock, hapcount, sep="_") {
  # haploblock: the name of the haploblock
  # hapcount: the number of possible haplotypes for this haploblock (= the
  #           number of the last haplotype)
  # result: a character vector with the haplotype names: all of the same
  #         length, with the appended haplotype number padded with zeroes
  paste0(haploblock, sep, padded(1:hapcount))
}

##!title Assign haplotype combinations based on output of inferHaps_noFS
##!description Assign haplotype combinations based on output of inferHaps_noFS
##!usage hapcomb_from_IHlist(ihl, mrkdids, ploidy, nocombs=FALSE)
##!param haploblockname Prefix to which the (zero-padded) haplotype numbers
##!are appended with separator '_'.
##!param ihl A list as returned by inferHaps_noFS
##!param mrkdids the mrkdids of all individuals
##!param ahcinfo a list as returned by loadAllHapcombLists
##!param nocombs if FALSE (default) the returned list does not have the matrix
##!hapcombs, only the vector ahccols
##!details The results of inferHaps_noFS, as passed in parameter ihl, determine
##!the haplotype combinations that match each mrkdid. This function translates
##!these results in a set of haplotype combinations for each individual.
##!return a list with
##!2 element: $hapcombs is the matrix as before, $ahccols is an integer vector
##!with for each indiv the column number in its mrkdid element of
##!ahc(complete)list (or NA if more than 1 ahccol possible).
hapcomb_from_IHlist <- function(ihl, mrkdids, ahcinfo, nocombs=FALSE) {
  #haplotypeNames <- getHaplotypeNames(haploblockname, nrow(ihl$allhap))
  nind <- sum(ihl$mrkdidsTable)
  hapcombs <- matrix(NA_integer_, nrow=ahcinfo$ploidy, #length(haplotypeNames),
                   ncol=length(mrkdids),
                   dimnames=list(NULL, names(mrkdids)))
  # we also return for each indiv the column nr in ahc(complete)list
  # for its mrkdid
  ahccols <- rep(NA_integer_, ncol(hapcombs))
  names(ahccols) <- colnames(hapcombs)
  for (i in seq_along(ihl$mrkdidsTable)) {
    did <- as.integer(names(ihl$mrkdidsTable)[i])
    mrkdidind <- which(mrkdids == did)
    #if (ncol(ihl$hclist[[i]]) == 1) {
    if (length(ihl$hclist[[i]]) == 1) {
      #we don't assign a haplotype combination if there is more than
      #one combination possible
      ahccols[mrkdidind] <- ihl$hclist[[i]]
      if (!nocombs) hapcombs[, mrkdidind] <-
         getAllHapcomb(mrkdid=did, nmrk=ncol(ihl$allhap),
                       ahcinfo=ahcinfo)[,ihl$hclist[[i]]]
    }
  }
  res <- list(ahccols=ahccols)
  if (!nocombs) res$hapcombs <- hapcombs
  res
} #hapcomb_from_IHlist

#'@title check a marker dosages matrix or data.frame
#'@description check a marker dosages matrix or data.frame, select columns and rows,
#'convert to matrix
#'@usage checkmrkDosage(mrkDosage, ploidy, indiv=NULL, markers=NULL,
#'generateMarkernames=TRUE)
#'@param mrkDosage matrix or data.frame. Markers are in rows, individuals in
#'columns, each cell has a marker dosage. If mrkDosage is a matrix,
#'the colnames are the individual names and the rownames are the marker names.
#'If mrkDosage is a data.frame and the name of the first column starts with
#'"marker" (upper/lowercase not relevant) the contents of this column are used
#'as marker names, else the rownames are used as marker names. The (other)
#'column names are the individual names.
#'All marker dosages must be in 0:ploidy or NA.
#'@param ploidy an integer value; all dosages are checked to be in 0:ploidy or NA
#'@param indiv NULL (default) or a character vector with names of individuals
#'to be selected. If NULL, all individuals are selected
#'@param markers NULL (default) or a character vector with names of markers
#'to be selected. If NULL, all markers are selected.
#'@param generateMarkernames if TRUE (default) and mrkDosage has no markernames
#'specified, markernames are generated automatically. Markernames
#'must be either present or be generated, and none may be missing.
#'@details This function is called by inferHaplotypes and by mergeReplicates, so
#'normally there is no need for the user to call this function directly.
#'@return a matrix with the selected columns in the order of indiv and the
#'selected rows in order of markers (or all columns or rows in the original
#'order if indiv or markers are not specified), with
#'names of individuals as column names, marker names as row names
#'@examples
#'mrkdos <- data.frame(
#'  marker=paste0("mrk", 1:3),
#'  indiv1=c(1, 2, 2),
#'  indiv2=c(4, 0, 0),
#'  indiv3=c(3, 0, 2))
#'# use all rows and columns:
#'checkmrkDosage(mrkDosage=mrkdos, ploidy=4)
#'# use only first and last row and column and change the order:
#'checkmrkDosage(mrkDosage=mrkdos, ploidy=4, indiv=c("indiv3", "indiv1"),
#'               markers=c("mrk3", "mrk1"))
#'@export
checkmrkDosage <- function(mrkDosage, ploidy, indiv=NULL, markers=NULL,
                           generateMarkernames=TRUE) {

  checkvector <- function(v, vname) {
    # v is not NULL
    if (anyNA(v)) stop(paste(vname, "may not contain missing values"))
    if (is.factor(v)) v <- as.character(v)
    if (is.character(v)) {
      v <- gsub("[[:blank:]]", "", v)
      if (any(v == "")) stop(paste(vname, "may not contain blank names"))
    }
    v
  }

  if (!is.null(indiv)) indiv <- checkvector(indiv, "indiv")
  if (!is.null(markers)) markers <- checkvector(markers, "markers")
  if (is.vector(mrkDosage)) {
    mrkDosage <- as.matrix(mrkDosage)
    #             1 col; names -> rownames
    colnames(mrkDosage) <- "1" #name of one individual
  }
  if (nrow(mrkDosage) == 0 || ncol(mrkDosage) == 0)
    stop("No data in mrkDosage")
  if (is.data.frame(mrkDosage)) {
    mrkcol <- which(tolower(substr(names(mrkDosage), 1, 6)) == "marker")
    if (length(mrkcol) == 1 && mrkcol==1) {
      rownames(mrkDosage) <- checkvector(mrkDosage[, 1], "markernames column")
      mrkDosage <- mrkDosage[, -1]
    }
    if (is.null(names(mrkDosage))) {
      names(mrkDosage) <- seq_len(length(mrkDosage))
    } else {
      names(mrkDosage) <- checkvector(names(mrkDosage), "column names of mrkDosage")
      if (anyDuplicated(names(mrkDosage)))
      stop("mrkDosage contains duplicated individual names")
    }
    if (is.null(indiv)) indiv <- names(mrkDosage)
    indcol <- match(indiv, names(mrkDosage))
    if (anyNA(indcol)) stop("Not all indiv occur in mrkDosage")
    isnum <- sapply(mrkDosage[,indiv], is.numeric)
    if (!all(isnum))
      stop("Not all selected columns of mrkDosage are numeric")
    mrkDosage <- as.matrix(mrkDosage[,indcol])
    #             rownames (markernames) and colnames (indiv) are retained
  }
  if (!is.matrix(mrkDosage))
    stop("mrkDosage is not a data.frame or matrix")
  # mrkDosage is (now) a numeric matrix
  # we will check it, possibly duplicating some earlier checks if mrkDosage
  # originally was a data.frame
  if (is.null(colnames(mrkDosage))) {
    colnames(mrkDosage) <- seq_len(ncol(mrkDosage))
  } else {
    colnames(mrkDosage) <-
      checkvector(colnames(mrkDosage), "column names of mrkDosage")
    if (anyDuplicated(colnames(mrkDosage)))
      stop("mrkDosage contains duplicated individual names")
  }
  if (is.null(indiv)) indiv <- colnames(mrkDosage)
  indcol <- match(indiv, colnames(mrkDosage))
  if (anyNA(indcol)) stop("Not all indiv occur in mrkDosage")
  mrkDosage <- mrkDosage[, indcol, drop=FALSE]
  if (is.null(rownames(mrkDosage))) {
    if (generateMarkernames) {
      rownames(mrkDosage) <- paste0("mrk", padded(seq_len(nrow(mrkDosage))))
    } else stop("no marker names specified")
  }
  rownames(mrkDosage) <- checkvector(rownames(mrkDosage), "marker names")
  if (is.null(markers)) markers <- rownames(mrkDosage)
  mrkrow <- match(markers, rownames(mrkDosage))
  if (anyNA(mrkrow)) stop("Not all markers occur in mrkDosage")
  mrkDosage <- mrkDosage[mrkrow,, drop=FALSE]
  if (!missing(ploidy) && !all(mrkDosage %in% c(NA, 0:ploidy)))
    stop("mrkDosage contains values not in 0:ploidy")
  mrkDosage
} #checkmrkDosage


single_cycle_infer <- function(allhap, mrkdidsTable, ahcinfo, knownHap,
                               useKnownDosage, progress) {
  #function used only by inferHaps_noFS
  #For each mrkdid, select the hapcombs that require the least new haplotypes
  #(not present in knownHap); if useKnownDosages is TRUE, among these select the
  #hapcombs with the maximum dosage of known haplotypes.
  #return: a list with 3 elements:
  # - nPresent: a vector with for each possible haplotype the minimum number of
  #             individuals in which it is inferred to be present (based on the
  #             selected hapcombs for each mrkdid)
  # - nAbsent: a vector with for each possible haplotype the minimum number of
  #            individuals in which it is inferred to be absent (based on the
  #            selected hapcombs for each mrkdid)
  # - hclist: a list with for each mrkdid the selected hapcombs as column
  #           numbers in ahc(complete)list
  abpr <- array(NA_integer_, dim=c(length(mrkdidsTable), 2, nrow(allhap)),
                dimnames=list(names(mrkdidsTable), c("nabs", "npres"),
                              1:nrow(allhap)))
  #       abpr stands for absent-present; contains for each mrkdid the min and
  #       max dosages of each haplotype, taken over all (remaining) haplotype
  #       dosage combinations
  hclist <- vector(length=length(mrkdidsTable), mode="list")
  for (i in seq_along(mrkdidsTable)) {
    #   for the current mrkdid, get the matrix with all possible
    #   haplotype combinations (different combinations in columns):
    ahc <- getAllHapcomb(mrkdid=names(mrkdidsTable)[i],
                        nmrk=ncol(allhap), ahcinfo=ahcinfo)
    ahccols <- 1:ncol(ahc)
    if (length(knownHap) > 0) {
      # Based on comparisons of 20171023, testscheme 1 is clearly worse, so:
      testscheme <- 2 # for now
      if (testscheme==1) {
        # versions before 20171006:
        # select the haplotype combination(s) with the highest total dosage
        # of known haplotypes:
        knowndos <- colSums(ahc[knownHap,, drop=FALSE])
        ahccols <- which(knowndos == max(knowndos))
        #ahc <- ahc[, knowndos==max(knowndos), drop=FALSE]
        ahc <- ahc[, ahccols, drop=FALSE]
      } else {
        #testscheme==2: version 20171006 (minimize the nr of haplotypes,
        #instead of maximize the dosage of known haplotypes, and if
        #useKnownDosage is TRUE, maximize the dosage of known haplotypes
        #among the selected hapcombs)
        ahc_new <- ahc
        ahc_new[ahc_new %in% knownHap] <- NA
        fn <- function(x) {length(unique(x[!is.na(x)]))}
        newcount <- apply(ahc_new, MARGIN=2, FUN=fn)
        ahccols <- which(newcount == min(newcount)) #least nr of new haplotypes
        if (useKnownDosage & length(ahccols) > 1) {
          subahc <- ahc_new[, ahccols, drop=FALSE]
          knowndos <- colSums(apply(subahc, MARGIN=2, FUN=is.na))
          ahccols <- ahccols[knowndos == max(knowndos)] #max dosage of known haplotypes
        }
        ahc <- ahc[, ahccols, drop=FALSE]
      }
      # (a different approach could be: for each possible set of haplotypes in
      # addition to the known ones, from small to large, see how many
      # individuals have one or more solutions, and stop adding haplotypes at
      # some point (when extra haplotypes would probably only be used to
      # fit errors in the SNP dosages);
      # but then, no selection is made between solutions for a specific did)
    }
    hclist[[i]] <- ahccols #was: ahc; now a vector of column nrs rather than the entire matrix
    mima <- haplofrqMinMax(hapdos=hapcomb2hapdos(ahc, nhap=nrow(allhap)))
    abpr[i,1,] <- mrkdidsTable[i] * (mima[,2] == 0) #nr of ind with this mrkdid
    #             that cannot have the haplotypes
    abpr[i,2,] <- mrkdidsTable[i] * (mima[,1] > 0) # nr of ind with this mrkdid
    #             that must have the haplotypes
  } # for i in mrkdidsTable
  nPresent <- apply(abpr[,2,,drop=FALSE], MARGIN=3, FUN=sum)
  #           vector with the nr of indiv in which each haplotype must be present
  nAbsent <- apply(abpr[,1,,drop=FALSE], MARGIN=3, FUN=sum)
  #           vector with the nr of indiv in which each haplotype must be absent
  names(hclist) <- names(mrkdidsTable)
  list(nPresent=nPresent,
       nAbsent=nAbsent,
       hclist=hclist)
} #single_cycle_infer


##!title infer haplotypes from marker dosages
##!@description infer haplotypes from marker dosages for a single haploblock,
##!treating all individuals as unrelated
##!usage inferHaps_noFS(mrkdids, ahcinfo, markers,
##minfrac=c(0.1, 0.01), knownHap=integer(0), useKnownDosage=TRUE, progress=TRUE)
##!param mrkdids a named integer vector with for each indiv its mrkdid; the
##names are the individual names.
##!param ahcinfo a list produced by loadAllHapcombFiles
##!param markers a character vector with the names of the markers in the
##haploblock, in the same order as used for calculating the mrkdids
##!param minfrac a vector of 1 or 2 fractions, the second smaller than the
##!first. A haplotype is considered to be certainly present if it must occur
##!in at least a fraction minfrac[1] of all individuals; default 0.1. For the
##!meaning of the optional second value in minfrac see Details.
##!param knownHap selected haplotypes (haplotypes that must be present according
##!to prior inference or knowledge, numbers refer to rows of matrix produced by
##!allHaplotypes); default integer(0), i.e. no known haplotypes.
##!param useKnownDosage a single logical, intended for internal use only.
##!; if TRUE (default), haplotype combinations for a mrkdid are selected not
##!only based on the smallest number of new haplotypes but also on the highest
##!total dosage of known haplotypes
##!param progress if TRUE (default), and new haplotype combinations need to be
##!calculated, and the number of markers and the ploidy are both >= 6, progress
##!is indicated by printed messages
##!details This function is for internal use; it is called only from
##!inherHaplotypes. It performs haplotype inference ignoring the presence of
##!FS families.
##The returned list contains some general calculations and statistics
##!based on mrkdids and ploidy, and some parts that are the result of the
##!inference. The primary of these is hclist: this contains for each mrkdid in
##!the population the most likely combination of haplotypes (sometimes
##!more than one, if several are equally likely = all have the maximum dosage
##!of haplotypes inferred to be present).\cr
##!Principle: first the haplotypes are derived that must be present in at least
##!a fraction minfrac of the individuals (because those haplotypes occur in all
##!possible haplotype configurations that result in the observed marker dosages).
##!In subsequent iterations, for individuals not completely explained by these
##!known haplotypes we select the hapcombs that require the smallest number of
##!extra haplotypes and we determine which haplotypes are required in all of
##!these solutions. For each haplotype we count in how many individuals it is
##!required, and all haplotypes that are required in a minimum fraction of
##individuals (minfrac), together with any haplotypes known from other
##!information (knownHap) become the new set of known haplotypes. We repeat this
##!step until the set of known haplotypes does not change any more.\cr
##!During this process there may be additional haplotypes needed for some
##!mrkdids, that are not frequent enough to reach the minfrac[1] criterion. If
##!there is a second (smaller) fraction in minfrac, in a final round we also consider
##!those extra haplotypes as must-be-present, provided they occur in at least
##!a fraction minfrac[2] of all individuals (minimum 2 individuals). If that
##!leads to more mrkdids having a unique haplotype combination AND not having
##!other mrkdids getting more combination the new assignments are used.\cr
##!If knownHap in the function call already contains haplotypes, these are
##!considered to be certainly present, even if they don't meet the minfrac
##!criterion. This may help to reduce the number of haplotype configurations
##!for some individuals, but may also increase them.
##!return a list with elements:
##!\itemize{
##!\item{nPresent: a vector with for each haplotype the number of
##!individuals in which it must occur}
##!\item{nAbsent: a vector with for each haplotype the number of
##!individuals in which it cannot occur}
##!\item{minfrac: same as parameter minfrac}
##!\item{allhap: a matrix as returned by allHaplotypes}
##!\item{mrkDosage: a  matrix with marker dosages for all individuals in indiv
##!(produced by calling checkmrkDosage on mrkDosage)}
##!\item{mrkdids: a vector of the mrkdid (marker dosage ID) for each individual in
##!mrkDosage (each combination of marker dosages has its own ID; if any of the
##!markers has an NA dosage the corresponding mrkdid is also NA)}
##!\item{mrkdidsTable: a table of the counts of all non-NA mrkdids in the
##!population}
##!\item{hclist: a list in which each element has the name of one of the mrkdids
##!in mrkdidsTable, and contains a matrix with <ploidy> rows and one column
##!for each of the remaining haplotype combinations: (a subset of the columns
##!of) the matrix returned by getAllHapcomb for that mrkdid}
##!}
inferHaps_noFS <- function(mrkdids, ahcinfo, markers,
                            minfrac=c(0.1, 0.01),
                            knownHap=integer(0), useKnownDosage=TRUE,
                            progress=TRUE) {
  if (anyNA(knownHap) || !all(knownHap %in% 1:(2^length(markers))) ||
      anyDuplicated(knownHap)) stop("invalid knownHap")
  if (useKnownDosage) useKnownDosage <- c(FALSE, TRUE) #else just FALSE
  origHap <- sort(knownHap)
  mrkdidsTable <- table(mrkdids)
  allhap <- allHaplotypes(markers)
  # haplocombs are already calculated in inferHaplotypes
  nind <- sum(mrkdidsTable) #includes only individuals with no missing marker data
  prev_knownHap <- list() #stores the knownHap after each cycle
  knownHap_ukd <- list(NULL)
  cycle <- 0
  for (ukd in seq_along(useKnownDosage)) {
    while (TRUE) {
      cycle <- cycle + 1
      res <- single_cycle_infer(allhap=allhap, mrkdidsTable=mrkdidsTable,
                                ahcinfo=ahcinfo, knownHap=knownHap,
                                useKnownDosage=useKnownDosage[ukd],
                                progress=progress)
      knownNew <- sort(union(origHap, which(res$nPresent >= minfrac[1] * nind)))
      haploLost <- setdiff(knownHap, knownNew)
      if (length(haploLost) > 0) {
        #The initial expectation was that within these cycles known haplotypes can
        #only be gained, not lost (except the origHap which need not be supported
        #by the current data, but these have been added back to knownNew).
        #Therefore the following warning was generated if haplotypes were lost:
        #warning(paste0("inferHaps_noFS: lost in cycle ", cycle, ": ",
        #              paste(haploLost, collapse=" ")))
        #However it seems that it is possible to lose previous must-have
        #haplotypes, although this does not happen often.
        #We must prevent infinite looping, so we store the knownHap and
        #if the situation occurs again with the same knownHap we break with
        #the original knownHap and the result of one single_cycle_infer with that
        #knownHap
        if (length(prev_knownHap) == 0 ||
            !any(sapply(prev_knownHap, identical, knownHap))) {
          prev_knownHap[[length(prev_knownHap) + 1]] <- knownHap
        } else {
          #recalculate the result of the first cycle - very fast and avoids
          #having to store it:
          res <- single_cycle_infer(allhap=allhap, mrkdidsTable=mrkdidsTable,
                                    ahcinfo=ahcinfo, knownHap=origHap,
                                    useKnownDosage=useKnownDosage[ukd],
                                    progress=FALSE)

          knownHap <- origHap
          break #break out of while loop
        }
      }
      if (identical(knownNew, knownHap)) {
        knownHap_ukd[[ukd]] <- knownHap
        break
      }
      knownHap <- knownNew
    } #while TRUE
  } #for ukd
  # if useKnownDosage has length 2 (FALSE, TRUE) and we had convergence after
  # the first ukd but not the second we should fall back on the first:
  if (length(useKnownDosage) == 2 && length(knownHap_ukd) == 1 &&
      !is.null(knownHap_ukd[[1]])) {
    res <- single_cycle_infer(allhap=allhap, mrkdidsTable=mrkdidsTable,
                              ahcinfo=ahcinfo, knownHap=knownHap_ukd[[1]],
                              useKnownDosage=FALSE,
                              progress=FALSE)
    knownHap <- sort(union(origHap, which(res$nPresent >= minfrac[1] * nind)))
  }
  #Now we have haplotype combinations selected for all mrkdids,
  #stored in res$hclist, based on the final knownHap set of
  #haplotypes that must be present.
  #The selected hapcombs for some mrkdids may include additional haplotypes
  #that did not make it into knownHap because they did not reach the minfrac
  #criterion.
  #We may now consider also those extra haplotypes as must-be-present
  #(or some of them, above a certain threshold: occurring in at least 2 and
  #at least 1% of the individuals) and see if that leads to more mrkdids
  #having a unique hapcomb (and not having other mrkdids getting additional
  #hapcombs)
  if (length(minfrac) > 1 && (minfrac[2] < minfrac[1])) {
    mrkdidHaccount <- vapply(res$hclist[names(mrkdidsTable)],
                             FUN=length, FUN.VALUE=0)
    onecomb <- mrkdidHaccount == 1
    if (any(!onecomb) && any(onecomb)) {
      # any(!onecomb): else all mrkdid have already one hapcomb, no need
      # for further selection
      # any(onecomb): else there are no mrkdid to generate additional knownHap
      hac1comb <- unlist(res$hclist[names(mrkdidsTable)[onecomb]])
      names(hac1comb) <- names(mrkdidsTable)[onecomb]
      #           a vector with the mrkdids as names and the ahc column nr as values
      #           for the mrkdids with just 1 hapcomb
      hapcomb <- matrix(NA_integer_, nrow=ahcinfo$ploidy, ncol=length(hac1comb))
      for (d in seq_along(hac1comb)) {
        hapcomb[, d] <-
          getAllHapcomb(mrkdid=names(hac1comb)[d], nmrk=ncol(allhap),
                        ahcinfo=ahcinfo)[, hac1comb[d]]
      }
      #calculate the nr of individuals in which each haplotype occurs:
      #select the mrkdids for which only 1 haplotype combination is possible,
      #and count the nr of indiv in which they occur:
      hapfrq <- colSums(as.vector(mrkdidsTable[onecomb]) *
                          t(hapcomb2hapdos(hapcomb, nhap=nrow(allhap)) > 0))
      final_hap <-
        sort(union(origHap,
                   which (hapfrq >= 2 & hapfrq > minfrac[2]*sum(mrkdidsTable))))
      resfinal <-
        single_cycle_infer(allhap=allhap, mrkdidsTable=mrkdidsTable,
                           ahcinfo=ahcinfo, knownHap=final_hap,
                           useKnownDosage=useKnownDosage[length(useKnownDosage)],
                           progress=progress)
      mrkdidHaccountfinal <-
        vapply(resfinal$hclist[names(mrkdidsTable)], FUN=length, FUN.VALUE=0)
      #keep resfinal instead of res only if all mrkdids that had only one
      #haplotype combi still have only one (and no dids have more columns
      #than before)
      if (all(mrkdidHaccountfinal <= mrkdidHaccount)) {
        res <- resfinal
        if (FALSE) { # set to TRUE for checking this
          #print comparison:
          lesscol <- mrkdidHaccountfinal < mrkdidHaccount
          m <- matrix(NA_integer_, nrow=2, ncol=sum(lesscol))
          colnames(m) <- names(mrkdidsTable)[lesscol]
          rownames(m) <- c("before", "after")
          m[1,] <- mrkdidHaccount[lesscol]
          m[2,] <- mrkdidHaccountfinal[lesscol]
          print(m)
        }
        # perhaps now there are more mrkdids with a unique haplotype combination
        # (which is the reason for this final test). If so, these may again have
        # extra haplotypes so we could do the final step again. We will not do
        # that because these extra haplotypes are selected on a weak basis
        # (weaker than minfrac[1])
      }
    }
  }

  #finally we add some extra data to the output list:
  res$allhap <- allhap
  res$mrkdidsTable <- mrkdidsTable
  res
} #inferHaps_noFS


chisq.probtest <- function(x, p=rep(1/length(x), length(x)),
                           correct=TRUE, rescale.p = FALSE) {
  #version of the standard chisq.test function, only for given probabilities,
  #that applies the Yates continuity correction in this case if correct=TRUE.
  #The original function applies Yates only in the case of a contingency test,
  #not for given probabilities.
  DNAME <- deparse(substitute(x))
  if (!is.null(dim(x))) {
    if (length(dim(x)) == 1) x <- as.vector(x) else
      stop("'x' must be a vector")
  }
  if (any(x < 0) || anyNA(x))
    stop("all entries of 'x' must be nonnegative and finite")
  if ((n <- sum(x)) == 0)
    stop("at least one entry of 'x' must be positive")
  if (length(x) == 1L)
    stop("'x' must at least have 2 elements")
  if (length(x) != length(p))
    stop("'x' and 'p' must have the same number of elements")
  if (any(p < 0))
    stop("probabilities must be non-negative.")
  if (abs(sum(p) - 1) > sqrt(.Machine$double.eps)) {
    if (rescale.p)
      p <- p/sum(p)
    else stop("probabilities must sum to 1.")
  }
  METHOD <- "Chi-squared test for given probabilities"
  E <- n * p
  if (correct) {
    YATES <- min(0.5, abs(x - E))
    if (YATES > 0)
      METHOD <- paste(METHOD, "with Yates' continuity correction")
  } else YATES <- 0
  V <- n * p * (1 - p)
  STATISTIC <- sum((abs(x - E) - YATES)^2/E) #old: sum((x - E)^2/E)
  names(E) <- names(x)
  PARAMETER <- length(x) - 1
  PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)

  names(STATISTIC) <- "X-squared"
  names(PARAMETER) <- "df"
  if (any(E < 5) && is.finite(PARAMETER))
    warning("Chi-squared approximation may be incorrect")
  structure(list(statistic = STATISTIC, parameter = PARAMETER,
                 p.value = PVAL, method = METHOD, data.name = DNAME,
                 observed = x, expected = E,
                 residuals = (x - E)/sqrt(E), stdres = (x - E)/sqrt(V)),
            class = "htest")
} #chisq.probtest

testFSseg <- function(parhac, DRrate, errfrac,
                      FSmrkdidsTable, allhap) {
  # This function performs a test of the observed segregation of mrkdids in a
  # FS family (supplied in FSmrkdidsTable) against the segregation expected
  # with parents having the haplotype combinations (given in parhac) under
  # assumed DRrate and errfrac.
  # This function does most of the work in solveOneFS::calcParcombStats().
  # parameters:
  # parhac: a matrix of 2 columns and ploidy rows with the haplotype combinations
  #         of the parents. Does not contain NAs!
  # DRrate, errfrac: the assumed frequency of DR and of SNP genotyping errors
  # FSmrkdidsTable: a table of the *observed* counts of each mrkdid in the FS
  # allhap: matrix of nmrk columns and 2^nmrk rows with all possible haplotypes
  # return value: a list with 2 components:
  # $stats: list with statistics for the given parental combination (parhac):
  #   $nhap: the number of haplotypes
  #   $P: P-value of the (chi-squared ...) test of the observed segregation
  #   $toosmall: logical vector with one value per pco, TRUE if the FS is
  #              too small and/or too fragmented to calculate a P-value
  # $didNhapcombLst: a list with the following components:
  #   $mrkdid: vector with names of expected mrkdids in the FS, in order of
  #            hapcomb (multiple hacs may have the same mrkdid!)
  #   $hapcomb: matrix of hacs expected in the FS (1 col per hac); with
  #             order of columns sorted so that all hacs with the same mrkdid
  #             are adjacent
  #   $expHCfreq: vector of the expected hac frequencies in order of hapcomb
  #   $dupdid: logical vector, TRUE where mrkdid[i] == mrkdid[i-1], else FALSE

  calcErrprob <- function(origP, ninv, invObs, FSsize, maxErrprob) {
    # Function within testFSseg
    # origP: vector, the error-free probabilities of all the non-invalid
    #        genotypes; the last value is the error-free total probability of
    #        the Invalid group (for the segregation check we may add some
    #        valid but low-probability genotypes to the Invalid group,
    #        hence it can have a non-zero probability). origP must sum to 1
    # ninv: scalar, the nr of genotypes in the Invalid group
    # invObs: the observed nr of individuals in the Invalid group
    # FSsize: the total nr of (fully genotyped) FS indiv
    # maxErrprob: the specified (maximum) probability of marker genotype
    #             errors (the probability that at least one of the markers in
    #             the haploblock is mis-genotyped)
    # Return: the inferred error probability: the errorprob that best
    #         explains the observed invObs, with the following restrictions:
    #         errprob<=maxErrprob, errprob >=
    #         the value leading to an expected invObs=1, except if that
    #         is higher than maxErrprob
    # The calculation is based on the same principles as in
    # solveOneFS::redistErr()
    if (invObs == 0) invObs <- 1
    invFrac <- invObs / FSsize
    ntot <- ninv + length(origP) - 1
    invP <- origP[length(origP)]
    errprob <- (invFrac - invP) * (ntot-1) / (ninv - ntot*invP)
    errprob <- max(errprob, 0)
    errprob <- min(errprob, maxErrprob)
    errprob
  } #calcErrprob in testFSseg

  redistErr <- function(origP, ninv, errP) {
    # Function within testFSseg
    # calculate the resulting probabilities from the original probabilities,
    # taking the error rate and the number of genotypes in the Invalid group
    # into account.
    # origP: vector, the error-free probabilities of all the non-invalid
    #        genotypes; the last value is the error-free total probability of
    #        the Invalid group (for the segregation check we may add some
    #        valid but low-probability genotypes to the Invalid group,
    #        hence it can have a non-zero probability). origP must sum to 1
    # ninv: scalar, the nr of different genotypes (*not individuals*) in
    #       the Invalid group
    # errP: the probability of genotyping errors
    # Principle: the errors are uniformly distributed over all genotypes except
    # the correct one. Depending on the number of genotypes in the Invalid group
    # more or less of the errors will go to that group.
    if (length(origP) < 2) return(origP)
    origPinv <- origP[length(origP)]
    origP <- origP[-length(origP)]
    nval <- length(origP) # nr of valid genotypes
    ntot <- ninv + nval
    Plost <- origP * errP
    PlostInv <- origPinv * errP * (nval / (ntot-1))
    Pgained <- matrix(0, nrow=nval, ncol=nval)
    # rows: valid genotypes to contribute those errors
    # cols: valid genotypes to gain from errors;
    PgainedInv <- rep(0, nval)
    for (rw in 1:nval) {
      Pgained[rw, ] <- Plost[rw] / (ntot-1)
      Pgained[rw, rw] <- 0
      PgainedInv[rw] <- Plost[rw] * ninv / (ntot-1)
    }
    c(origP - Plost + colSums(Pgained) + PlostInv/nval,
      origPinv - PlostInv + sum(PgainedInv))
  } #redistErr within testFSseg

  nmrk <- ncol(allhap)
  ploidy <- nrow(parhac)
  FScount <- sum(FSmrkdidsTable) # count of fully genotyped FS indiv
  stats <- list()
  stats$nhap <- length(unique(as.vector(parhac))) #length(table(parhac)) nr of different haplotypes
  FSseg <- getFSfreqs(parhac=parhac, DRrate=DRrate)
  #we will now get the expected freq of unique mrkDosage in the FS (note that
  #(FSseg$FShac has unique *haplotype* combinations, but that different
  #haplotype combinations may result in the same *marker dosage*
  #combinations)
  FSmrkdos <- hapcomb2mrkdos(FSseg$FShac, allhap=allhap)
  #find the order to sort the FSmrkdos columns (on all rows) = in order of mrkdid
  #(in this way columns with the same *marker dosages* are next to each other):
  o <- do.call(order, lapply(1:nrow(FSmrkdos), function(i) FSmrkdos[i,]))
  FSmrkdos <- FSmrkdos[, o, drop=FALSE]
  expFSfreq <- FSseg$freq [o]
  du <- duplicated(FSmrkdos, MARGIN=2)
  didNhapcombLst <- list(mrkdid=colnames(FSmrkdos), #expected mrkdids (may contain duplicates)
                         hapcomb=FSseg$FShac[, o, drop=FALSE], #expected hacs in same order
                         expHCfreq=expFSfreq, #expected hac freqs in same order
                         dupdid=du) #whether each hapcomb col has same did as previous
  uniq <- which(!du)
  #sum the freqs per set of duplicates:
  uniqfreq <- numeric(length(uniq)); names(uniqfreq) <- colnames(FSmrkdos)[uniq]
  #uhx <- c(uniq, length(du)+1)
  #for (u in seq_along(uniqfreq))
  #  uniqfreq[u] <- sum(expFSfreq[uhx[u]:(uhx[u+1]-1)])
  uniqfreq <- tapply(expFSfreq, as.integer(didNhapcombLst$mrkdid), FUN=sum) # expected frequencies of mrkdids in FS
  FSmrkdos <- FSmrkdos[, uniq, drop=FALSE]
  expFSmrkdids <- colnames(FSmrkdos)
  # the expected expFSmrkdids are now unique,
  # their (error-free) probabilities are in uniqfreq

  # The chisquare test of the segregation is very sensitive to mrkdids with
  # a low probability where still a plant is observed. Therefore we group
  # those mrkdids with the invalid mrkdids and test the total nr of FS indiv
  # in that group against the total expectation. So:
  # - group all the hapcombs with an expected nr of individuals < 1 together
  #   with all invalid hapcombs into one group and get its total probability
  # - adjust the probabilities of the valid mrkdids and the invalid group
  #   with the error rate.
  # - Test the observed segregation (incl the invalid group) against the
  #   expected segregation
  lowfrqMrkdids <- uniqfreq * FScount < 1
  # and those that are observed but unexpected (invalid):
  invalidMrkdids <- which(!(names(FSmrkdidsTable) %in% expFSmrkdids))
  # calculate the probabilities:
  invalidProb <- sum(uniqfreq[lowfrqMrkdids]) #the really invalid have prob=0
  validProbs <- uniqfreq[!lowfrqMrkdids]
  # get the observed frequencies in the same order:
  invalidObs <- sum(FSmrkdidsTable[invalidMrkdids]) +
    sum(FSmrkdidsTable[names(FSmrkdidsTable) %in% expFSmrkdids[lowfrqMrkdids]])
  validObs <- FSmrkdidsTable[match(expFSmrkdids[!lowfrqMrkdids],
                                   names(FSmrkdidsTable))]
  validObs[is.na(validObs)] <- 0
  # redistribute the error probability over the mrkdids:
  ninv <- (ploidy+1)^nmrk - length(validProbs) #the number of
  #       mrkdids in the Invalid group
  errP <- calcErrprob(origP=c(validProbs, invalidProb), ninv=ninv,
                      invObs=invalidObs, FSsize=FScount, maxErrprob=errfrac)
  probs <- redistErr(origP=c(validProbs, invalidProb),
                     ninv =ninv, errP = errP)
  # We always have at least one expected mrkdid (in case of
  # no segregation) and a category Invalid
  # Four situations:
  # (1) no valid mrkdid remain (small FS family, highly segregating, large
  #     errfrac, so that no mrkdid has an expectation of >= 1 FS indiv):
  #     toosmall <- TRUE
  # (2) at least one invalid FS expected (including low-freq mrkdids and the
  #     errfrac): P <- chisquare test on valid mrkdids + invalid group
  # (3) less than 1 invalid expected:
  #     calculate Pinvalid <- binomial P for invalidObs
  # (3a) only one valid mrkdid: Pseg <- 1.0
  # (3b) more than one valid mrkdid: Pseg <- chisquare test on valid mrkdids
  # (3a and 3b) P <- Pinvalid * Pseg
  if (length(probs) == 1) {
    # (1): only the invalid group, no valid mrkdids with >= 1 expected FS indiv
    stats$toosmall <- TRUE
    stats$P <- NA
  } else {
    stats$toosmall <- FALSE
    if (probs[length(probs)] * FScount >= 1) {
      # (2): invalid group large enough for inclusion in chiquared test
      stats$P <- suppressWarnings({
        #suppress warnings about approximation for small values
          chisq.probtest(c(validObs, invalidObs), p=probs)$p.value
      })
    } else {
      # (3): less than 1 "invalid" FS expected, don't include invalid group
      #      in chisquared test:
      Pinvalid <- 1 - pbinom(invalidObs, FScount, prob=probs[length(probs)])
      if (length(validObs) <= 1) {
        Pseg <- 1.0 # (3a): only one valid mrkdid
      } else {
        # (3b): more than one valid mrkdid
        suppressWarnings({
          #suppress warnings about approximation for small values
          p <- probs[-length(probs)]; p <- p / sum(p)
          Pseg <- chisq.probtest(validObs, p=p)$p.value
        })
      }
      stats$P <- Pinvalid * Pseg
    }
  }
  list(stats=stats, didNhapcombLst=didNhapcombLst)
} #testFSseg

##!title find the optimal combination of parental haplotype combinations
##!description For one FS family and one haploblock, find the optimal
##!combination of parental haplotype combinations
##!usage solveOneFS(mrkDosage, ihl, FS, P, priorPhac, hacXmat, maxparcombs,
##!ploidy, minfrac, errfrac, DRrate, minPseg)
##!param mrkDosage dosage matrix with only the markers in the current haploblock
##!param ihl a list returned by inferHaps_noFS
##!param FS character vector with sample names of the current FS
##!param P vector with 2 sample names, one for parent1 and one for parent2.
##!There should be corresponding columns in mrkDosage , and orig_ihl should be
##!based on that mrkDosage
##!param knownPhac: a vector with two numbers (or NA) indexing the column in
##ahc(complete)list for the Pdids that has already be inferred to be the solution
##for that parent
##!triedPhacs: a list of two vectors of numbers (possibly empty) with the Phacs
##!already tried with no success for firstcombs in an earlier call for this FS,
##!before the FS was skipped because of too many extracombs. Numbers are colnrs
##!in ahclist
##!param maxparcombs Parent 1 and 2 both may have multiple possible haplotype
##!combinations. For each pair of haplotype combinations (one from P1 and one
##!from P2) the expected FS segregation must be checked against the observed.
##!This may take a long time if many such combinations need to be checked.
##!This parameter sets a limit to the number of allowed combinations.
##!param ploidy all marker dosages should be in 0:ploidy or NA
##!param minfrac vector of two fractions, default 0.1 and 0.01. A haplotype is
##!considered to be certainly present if it occurs in at least a fraction
##!minfrac[1] of all individuals; in the final stage for the "other"
##!individuals (those that do not belong to the FS or its parents) this fraction
##!is lowered to minfrac[2]; see also inferHaps_noFS
##!param errfrac the assumed fraction marker genotypes with an error (over all
##!markers in the haploblock). The errors are assumed to be uniformly distributed
##!over all except the original marker dosage combinations (mrkdids)
##!param DRrate the rate of double reduction per meiosis (NOT per allele!); e.g.
##!with a DRrate of 0.04, a tetraploid parent with genotype ABCD will produce
##!a fraction of 0.04 of DR gametes AA, BB, CC and DD (each with a frequency of
##!0.01), and a fraction of 0.96 of the non-DR gametes AB, AC, AD, BC, BD, CD
##!(each with a frequency of 0.16)
##!param minPseg candidate solutions with a chisquared P-value for segregation
##!smaller than this threshold are rejected; if none of the candidates remain
##!no solution is found (and in the return list parcombok has 0 rows and
##!NoSolution is TRUE)
##!return a list with two items\cr
##! $FSdat: itself a list with items
##!   $message\cr
##!   $parcombok: a matrix with 0 or more rows and two columns: each row represents
##!             one solution, with the two numbers on that row the hapixes
##!             of the parental hacs (Phacs). parcombok has 0 rows if
##!             no solution found. NEW: stores the ahclist column numbers, not hapixes
##!   $didNhapcomb: a list with one item per selected solution; each item is itself
##!               a list with (a.o.) expected FS frequencies of mrkdids and the
##!               corresponding haplotype combinations\cr
##!   $NoSolution: TRUE or FALSE: TRUE if one or more solutions selected,
##!              FALSE if no solution attempted (one or both parent(s) or all
##!              FS indiv with missing mrk data, or no fitting
##!              solutions found)\cr
##!              If NoSolution is TRUE didNhapcomb and Pseg may not be
##!              present
##!   $Pseg: a vector of P-values, one for each row of parcombok, or one if
##!             no solutions (P of the best rejected solution)
##! and a second item:
##! $hacXmat: matrix as parameter hacXmat, possibly extended with additional hacs
##! NEW: haxXmat not used or output anymore
# solveOneFS is not exported: internal function
solveOneFS <- function(mrkdids, ihl, FS, P, knownPhac, triedPhacs, maxparcombs,
                       ahcinfo, minfrac, errfrac, DRrate, minPseg) {

  calcParcombok <- function(potPhacs, oldPhacs, allhap, Pdids, FSfrac,
                            P.reject=c(0.05, 0.001), ahcinfo) {
    # Function within solveOneFS
    # Calculates the combinations of parental haplotype combinations
    # that can explain (almost) all FS individuals
    # potPhacs: list of 2 vectors with the columns of potential haplotype
    #           combinations for parents P1 and P2 (indexing columns in the
    #           ahclist items for their mrkdids)
    # oldPhacs: list of 2 vectors with columns of parental haplotype combinations
    #           that were already calculated and whose P1/P2 combinations were
    #           no solutions (may have length 0 if none)
    # allhap:   a matrix with all possible haplotypes with the nr of markers
    #           in the currect haploblock
    # FSfrac:   the fraction of FS individuals for which it is accepted
    #           that they are not explained by a parental haplotype combination
    # Pdids:    vector of 2 integers: the parental mrkdids
    # P.reject: two P values, decending: used in a binomial test for the
    #           nr of unexplained FS indiv
    # ahcinfo: a list returned by loadAllHapcombLists
    #TODO: parallellize; see also getAllHapcomb
    # Phacs are column numbers in ahc(complete)list
    if (all(potPhacs[[1]] %in% oldPhacs[[1]]) &&
        all(potPhacs[[2]] %in% oldPhacs[[2]])) {
       #no new parental combinations, so no solutions
      parcombok <- matrix(integer(0), ncol=2)
      setaside_pcs <- parcombok
    } else {
      missedFSs <- matrix(NA_integer_,
                          nrow=length(potPhacs[[1]]), ncol=length(potPhacs[[2]]))
      # missedFSs is the number of FS indivs with a mrkdid that cannot be
      # produced with each parental combination (assuming no DR)
      for (pc1 in seq_along(potPhacs[[1]])) {
        for (pc2 in seq_along(potPhacs[[2]])) {
          isnewpc <- !(potPhacs[[1]][pc1] %in% oldPhacs[[1]] &&
                         potPhacs[[2]][pc2] %in% oldPhacs[[2]])
          if (isnewpc) {
            #this is a new combination, check:
            FScombs <- getFScombs(cbind(
              getAllHapcomb(mrkdid=Pdids[1], nmrk=ncol(allhap),
                            ahcinfo=ahcinfo)[, potPhacs[[1]][pc1]],
              getAllHapcomb(mrkdid=Pdids[2], nmrk=ncol(allhap),
                            ahcinfo=ahcinfo)[, potPhacs[[2]][pc2]]))
            FSmrkdos <- hapcomb2mrkdos(FScombs, allhap=allhap)
            FScombmrkdids <- colnames(FSmrkdos)
            missedmrkdids <- setdiff(names(FSmrkdidsTable), FScombmrkdids)
            missedFSs[pc1, pc2] <-
              sum(FSmrkdidsTable[names(FSmrkdidsTable) %in% missedmrkdids])
          } else {
            #this is an old parcomb, either already rejected or set aside.
            #in both cases we don't need to check it again, so have it discarded:
            missedFSs[pc1, pc2] <- length(FS)
          }
        }
      }
      # explanation of the two levels of P.reject: see large comment at
      # the start of the main code of solveOneFS
      FSsize <- sum(FSmrkdidsTable) #nr of non-missing FS indiv
      T.reject <- qbinom(1-P.reject[1], FSsize, FSfrac) #reject  if missedFSs >= T.reject
      if (T.reject <= 1) {
        #always allow 1 missedFS:
        T.reject <- 2
      }
      parcombok <- missedFSs < T.reject
      parcombok <- which(parcombok, arr.ind=TRUE) #now a matrix with one row for
      # each valid combination of potPhacs[[1]] and potPhacs[[2]]
      # (indexing the PotPhacs). Change to indexing in ahclist:
      for (p in 1:2) parcombok[, p] <- potPhacs[[p]][parcombok[, p]]
      # now the indices to the ahc(complete)list columns for the Pdids
      # instead of indices to the potPhacs
      colnames(parcombok) <- c("P1col", "P2col")
      # also produce data for the parcombs passing P.reject[2]
      # but not P.reject[1]:
      T.rej2 <- qbinom(1-P.reject[2], FSsize, FSfrac) #reject  if missedFSs >= T.reject
      if (T.rej2 <= 1) {
        #always allow 1 missedFS:
        T.rej2 <- 2
      }
      # get the setaside_pcs analogous to parcombok
      setaside_pcs <- missedFSs >= T.reject & missedFSs < T.rej2
      setaside_pcs <- which(setaside_pcs, arr.ind=TRUE) #indexing PotPhacs
      for (p in 1:2) setaside_pcs[, p] <- potPhacs[[p]][setaside_pcs[, p]] #indexing ahclist
      colnames(setaside_pcs) <- c("P1col", "P2col")
    }
    list(parcombok=parcombok, setaside_pcs=setaside_pcs)
  } #calcParcombok within solveOneFS

  calcParcombstats <- function(pco, parcombok, Pdids, parcombstats,
                               didNhapcombLst, allhap, ahcinfo) {
    # pco: number, parental combination (index to parcombok rows)
    # Pdids: vector of 2 integers: the parental mrkdids
    # parcombstats: list with statistics for each parental combination, see
    #               return value
    # didNhapcombLst: list as in return value
    # return value: a list with 2 components:
    # $parcombstats: list with statistics for each parental combination:
    #   $nhap: a vector with the number of haplotypes for each pco
    #   $P: vector with the P-value  for each pco
    #   $toosmall: logical vector with one value per pco, TRUE if the FS is
    #              too small and/or too fragmented to calculate a P-value
    # $didNhapcombLst: a list with one component for each pco; each of these
    #   has the following components:
    #   $mrkdid: vector with names of mrkdids in the FS, in order of hapcomb
    #            (multiple hacs may have the same mrkdid!)
    #   $hapcomb: matrix of hacs expected in the FS (1 col per hac); with
    #             order of columns sorted so that all hacs with the same mrkdid
    #              are adjacent
    #   $expHCfreq: vector of the expected hac frequencies in order of hapcomb
    #   $dupdid: logical vector, TRUE where mrkdid[i] == mrkdid[i-1], else FALSE
    parhac <- cbind(
      getAllHapcomb(mrkdid=Pdids[1], nmrk=ncol(allhap),
                    ahcinfo=ahcinfo)[, parcombok[pco, 1]],
      getAllHapcomb(mrkdid=Pdids[2], nmrk=ncol(allhap),
                    ahcinfo=ahcinfo)[, parcombok[pco, 2]])
    # parhac does not contain NAs!
    segdat <- testFSseg(parhac=parhac, DRrate=DRrate, errfrac=errfrac,
                        FSmrkdidsTable=FSmrkdidsTable, allhap=allhap)
    parcombstats$nhap[pco] <- segdat$stats$nhap
    parcombstats$toosmall[pco] <- segdat$stats$toosmall
    parcombstats$P[pco] <- segdat$stats$P
    didNhapcombLst[[pco]] <- segdat$didNhapcombLst
    list(parcombstats=parcombstats, didNhapcombLst=didNhapcombLst)
  } #calcParcombstats in solveOneFS

  findPthreshold <- function(minP, Pvalues) {
    # this function sets a minimum for the P-value to accept a parental hac
    # combination. The aim is to keep only a limited set of possibilities,
    # if there are many above minP (default 1e-8)
    # minP: the hard minimum threshold
    # Pvalues: all P values from which to select
    # version:
    # "simple": set threshold to max(P)/1000 if that is above minP, else to minP
    #           (this is the method used from version 1.0.1)
    # "orig": as in version 1.0.0, i.e. apply a threshold of 1e-4 if there are
    #        solutions above this, and else minP;
    # "gap": find the largest gap between the log10(Pvalues); if that gap is >=2
    #        select only the P-values above it (and above minP)
    version <- "simple" # for version 1.0.1; other options are orig and gap
    if (version=="simple") {
      return(max(minP, max(Pvalues/1000)))
    } else  if (version=="orig") {
      if (max(Pvalues) > 1e-4) return(1e-4) else return(minP)
    } else if (version=="gap") {
      if (length(Pvalues) == 1) return(minP)
      Pvalues <- sort(Pvalues, decreasing=TRUE)
      gaps <- diff(log10(Pvalues))
      g <- which.max(gaps)
      if (g<2 || gaps[g] < 2) return(minP) else return(max(minP, Pvalues[g]))
    } else {
      # default
      return(minP)
    }
  } # findPthreshold

  selFromParcombok <- function(parcombok, allhap, minPseg) {
    # parcombok: matrix with 2 columns (1 for each parent) and one row
    #      per parental combination; numbers index the columns of the
    #      parental mrkdid entry in ahc(complete)list
    # allhap: matrix of all possible haplotypes
    # minPseg: minimum chi-squared P-value to test the observed segregation
    # return: list of
    # $parcombstats for all parcombs in parcombok
    # $didNhapcombLst for all parcombs in parcombok
    # $sel: vector of indices to selected rows of parcombok (and to
    #       $parcombstats and $didNhapcombLst)
    if (nrow(parcombok) == 0) {
      return(list(sel=integer(0)))
    }
    parcombstats <-
      data.frame(P = rep(NA_real_, nrow(parcombok)),
                 nhap = rep(NA_integer_, nrow(parcombok)),
                 toosmall = rep(FALSE, nrow(parcombok)))
    didNhapcombLst <- list() #save calculated data that can be used later
    #                        to convert FS mrkdids into FS hapcombs
    for (pco in seq_len(nrow(parcombok))) {
      #calc parcombstats[pco,] and didNhapcombLst[[pco]] :
      tmp <- calcParcombstats(pco=pco, parcombok=parcombok, Pdids=Pdids,
                              parcombstats=parcombstats,
                              didNhapcombLst=didNhapcombLst,
                              allhap=allhap, ahcinfo=ahcinfo)
      parcombstats <- tmp$parcombstats
      didNhapcombLst <- tmp$didNhapcombLst
    } # for pco

    # now we have the parcombstats calculated for all pco.
    # we reject all that have P <= minPseg and
    # from the remaining we select the one with the "best" combination
    # of nhap and P
    Pvalues <- parcombstats$P[!is.na(parcombstats$P)]
    if (length(Pvalues) == 0) {
      sel <- integer(0)
    } else {
      Pthreshold <- findPthreshold(minP=minPseg, Pvalues=Pvalues)
      sel <- which(!is.na(parcombstats$P) & parcombstats$P >= Pthreshold)
    }
    list(sel=sel, parcombstats=parcombstats, didNhapcombLst=didNhapcombLst)
  } # selFromParcombok within solveOneFS

  # START of solveOneFS
  FSfrac <- errfrac + DRrate # the naive fraction "unexpected" FS individuals
  NoSolution <- FALSE
  maxP <- -1
  PahccolsTried <- list(integer(0), integer(0)) # to which oldPhacs must be added at the end
  FSmrkdidsTable <- table(mrkdids[names(mrkdids) %in% FS], useNA="no")
  FScount <- sum(FSmrkdidsTable)
  Pdids <- mrkdids[match(P, names(mrkdids))] #Pdids: Parental marker dosage IDs
  Phacs <- vector("list", 2) #Phacs: Parental haplotype combinations (column nrs in ahc(complete)list)
  for (p in 1:2) {
    if (is.na(Pdids[p])) {
      Phacs[[p]] <- integer(0)
    } else {
      #get the parental haplotype combinations based on ihl:
      if (is.na(knownPhac[p])) {
        i <- which(names(ihl$hclist) == Pdids[p])
        Phacs[[p]] <- ihl$hclist[[i]] # a subset of all possible hacs
      } else {
        Phacs[[p]] <- knownPhac[p]
      }
    }
  }
  sel <- integer(0)
  if (anyNA(Pdids) || FScount == 0) {
    message <- ("missing marker data in parents or no FS data; skipped")
    NoSolution <- TRUE
  } else {
    message <- ""
    # We consider two sets of parental hapcombs: first the
    # ones based on the ihl obtained ignoring any population structure:
    # this leads to a subset of possible hapcombs for each parent. And
    # second, if no suitable parental hapcombs are found in the first set,
    # we consider all possible parental hapcombs (from the ahc(complete)list);
    # of course we skip the ones that were already considered in the first set.
    # In the first and the second set, we first only check which fraction
    # of the FS cannot be explained by each parcomb (ignoring DR) - this
    # is a relatively fast check. We do a binomial test and only keep the
    # parcombs where the probability of that amount of "missingFS" is less
    # than P.reject[1] (=0.05). The parcombs that pass this first test
    # are next checked further; the segregation of mrkdids in the FS is
    # tested against the expected segregation using a chi-squared test, and
    # this is combined with a P-value for unexpected mrkdids. Here DR is
    # taken into account.
    # We use two P.reject levels: a strict (0.05 as before) and
    # a more lenient (0.001) in the first filtering. Now we consider
    # successively the segregation of the parcombs: (1) that result from
    # the restricted ihl set and pass P.reject 0.05, (2) that result from
    # the complete ahclist set and pass P.reject 0.05, (3) that result from the
    # restricted ihl set and pass P.reject 0.001, (4) that result from the
    # complete ahclist set and pass P.reject 0.001. As soon as one of these
    # 4 sets yields acceptable parcombs the next are all skipped.
    # This procedure is quite complex, but it will allow more often a solution
    # while not increasing the computations too much, and not accepting a bad
    # (incorrect) solution from the ihl set if a better one exists.
    firstcombs <- length(Phacs[[1]]) * length(Phacs[[2]]) # from ihl
    if (firstcombs == 0) stop("firstcombs==0, should not happen!")
    if (firstcombs > maxparcombs) {
      # too many potential combinations (it takes about 45 min to check
      # 150000 combinations)
      message <- paste("more than", maxparcombs, "parental combinations; skipped")
      parcombok <- matrix(integer(0), ncol=2)
    } else {
      parcombok <- calcParcombok(potPhacs=Phacs,
                                 oldPhacs=PahccolsTried,
                                 allhap=ihl$allhap, Pdids=Pdids,
                                 FSfrac=FSfrac, ahcinfo=ahcinfo)
      PahccolsTried <- Phacs
      step1pc <- parcombok$setaside_pcs
      parcombok <- parcombok$parcombok
      # note that in this first try the oldPhacs have 0 columns as
      # PahccolsTried is created empty at start of SolveOneFS
      tmp <- selFromParcombok(parcombok=parcombok, allhap=ihl$allhap,
                              minPseg=minPseg)
      sel <- tmp$sel; parcombstats <- tmp$parcombstats;
      didNhapcombLst <- tmp$didNhapcombLst
      maxP <- max(maxP, parcombstats$P, na.rm=TRUE)
      #print(paste("firstcombs, firstsel: P=", parcombstats$P, "length(sel)=", length(sel)))
      if (length(sel) == 0) {
        #No parental combinations from inferHaps_noFS, preselected with
        #stringent P.reject, fit
        #1. See how many parental combinations are possible without first
        #   inferHaps_noFS (i.e. with getAllHapcomb).
        #   If p1*p2 < maxparcombs we try them all.
        nwPhacs <- vector("list", 2)
        for (p in 1:2) {
          if (is.na(knownPhac[p])) {
            nwPhacs[[p]] <-
              1:ncol(getAllHapcomb(mrkdid=Pdids[p], nmrk=ncol(ihl$allhap),
                                   ahcinfo=ahcinfo))
          } else {
            nwPhacs[[p]] <- knownPhac[p]
          }
        }
        extracombs <-
          length(nwPhacs[[1]]) * length(nwPhacs[[2]]) - firstcombs
        if (extracombs > maxparcombs) {
          message <- paste("more than", maxparcombs,
                           "extra parental combinations; skipped")
          parcombok <- step2pc <- matrix(nrow=0, ncol=2,
                                         dimnames=list(NULL,c("col1", "col2")))
          sel <- integer(0)
        } else {
          # we try all possible parcombs
          # calculate for extra parcombs:
          parcombok <- calcParcombok(potPhacs=nwPhacs,
                                     oldPhacs=PahccolsTried,
                                     allhap=ihl$allhap, Pdids=Pdids,
                                     FSfrac=FSfrac, ahcinfo=ahcinfo)
          for (p in 1:2)
            PahccolsTried[[p]] <- sort(union(PahccolsTried[[p]], nwPhacs[[p]]))
          step2pc <- parcombok$setaside_pcs
          parcombok <- parcombok$parcombok
          tmp <- selFromParcombok(parcombok=parcombok, allhap=ihl$allhap,
                                  minPseg=minPseg)
          sel <- tmp$sel; parcombstats <- tmp$parcombstats;
          didNhapcombLst <- tmp$didNhapcombLst
          maxP <- max(maxP, parcombstats$P, na.rm=TRUE)
        }
        if (length(sel) == 0) {
          # now we add two more sets of candidate parcombs: those that were
          # set aside in the firstcombs and in the extracombs check.
          # (i.e. those that failed the preselection with P.reject[1] but
          # passed with P.reject[2])
          parcombok <- step1pc # the setaside_pcs from the firstcombs set
          tmp <- selFromParcombok(parcombok=parcombok, allhap=ihl$allhap,
                                  minPseg=minPseg)
          sel <- tmp$sel; parcombstats <- tmp$parcombstats;
          didNhapcombLst <- tmp$didNhapcombLst
          maxP <- max(maxP, parcombstats$P, na.rm=TRUE)
          if (length(sel) > 0) {
            # remove a possible message about too many extracombs,
            # else the FS will be considered as not fitted:
            message <- ""
          } else {
            # length(sel) == 0
            parcombok <- step2pc
            tmp <- selFromParcombok(parcombok=parcombok, allhap=ihl$allhap,
                                    minPseg=minPseg)
            sel <- tmp$sel; parcombstats <- tmp$parcombstats;
            #print(paste("extracombs, secondsel: P=", parcombstats$P, "length(sel)=", length(sel)))
            didNhapcombLst <- tmp$didNhapcombLst
            maxP <- max(maxP, parcombstats$P, na.rm=TRUE)
          } # no firstcombs remain if preselected with P.reject[2]
        } # no extracombs remain if preselected with P.reject[1]
      } # no firstcombs remain if preselected with P.reject[1]
    } # not too many firstcombs
  } # parental and FS data available

  # now we have a parcombok with matching sel, parcombstats and didNhapcombLst
  if (maxP == -1) maxP <- NA
  if (length(sel) == 0) {
    if (message=="") message <- "no valid FS solutions found"
    NoSolution <- TRUE #after analyzing all possible parental combinations
  }
  if (message != "") {
    return(list(message=message,
                Pdids=Pdids,
                parcombok=matrix(integer(0), ncol=2),
                NoSolution=NoSolution,
                SolFound=FALSE,
                Pseg=NA,
                maxPseg=maxP,
                PahccolsTried=PahccolsTried))
  }
  # now we have acceptable solutions: sel refers to the acceptable rows of
  # parcombok

  # SECOND STEP: find the best fitting of the acceptable parental haplotype
  # combinations.
  # parcombok indicates which combinations of these match (almost) all FS's.
  # Now, any of these combinations could be the correct one. We check all
  # and select the best, based on match with expected segregation (including
  # the expected error and DR rates) and the number of haplotypes.
  if (length(sel) == 1) {
    sel3 <- sel
  } else {
    #among these we need to select the best solution(s), based on P and nhap.
    #If we take one extra haplotype to be as bad as a 5-fold lower P,
    #we can use the selection criterion:
    selcrit <- parcombstats$nhap[sel] - log(parcombstats$P[sel], base=5)
    if (sum(!is.na(selcrit)) == 0) stop("error in selcrit")
    toler <- log(0.5, base=5) #tolerance: select all that have a P-value >= 0.5 * highest P
    # old: sel2 <- which.min(selcrit) # sel2 indexes the best among the sel
    sel2 <- which(selcrit >= min(selcrit) + toler) #multiple equivalent solutions possible
    sel3 <- sel[sel2] #sel3 indexes the best among the original parcombs
  }
  return(list(message="",
              Pdids=Pdids,
              parcombok=parcombok[sel3,, drop=FALSE],
              didNhapcomb=didNhapcombLst[sel3],
              NoSolution=FALSE,
              SolFound=TRUE,
              Pseg=parcombstats$P[sel3],
              maxPseg=max(parcombstats$P[sel3]),
              PahccolsTried=PahccolsTried))
} #solveOneFS

checkpops <- function(parents, FS, mrkDosage) {
  message <- ""
  if (missing(parents) || is.null(parents) ||
      (is.matrix(parents) && nrow(parents)==0))
    parents <- matrix(character(0), ncol=2)
  if (missing(FS) || is.null(FS)) FS <- list()
  if (!is.list(FS))
    message <- "FS must be a list"
  if (!is.matrix(parents) || ncol(parents) != 2)
    message <- "parents must be a matrix with 2 columns"
  if (length(FS) != nrow(parents))
    message <- "FS does not match parents"
  FS <- lapply(FS, as.character)
  allFSind <- unlist(FS)
  if (anyNA(allFSind) || !all(allFSind %in% colnames(mrkDosage)))
    message <- "All FS indiv must occur in mrkDosage"
  parents <- matrix(as.character(parents), ncol=2)
  if (anyNA(parents) || !all(parents %in% colnames(mrkDosage)))
    message <- "All parents must occur in mrkDosage"
  list(parents=parents, FS=FS, message=message)
} #checkpops

checkHaploblock <- function(haploblock, mrkDosage) {
  # haploblock is a list of haploblocks, or a vector of marker names in
  # a single haploblock.
  # Returns the haploblock list with empty haploblocks deleted and with
  # names added to nameless haploblocks.
  # Produces an error if not all markers occur in matrix mrkDosage (the
  # reverse is allowed: mrkDosage may contain markers not in haploblock)
  if (is.list(haploblock)) {
    haploblock <- haploblock[sapply(haploblock, length) > 0]
    mrknames <- unlist(haploblock)
    if (anyNA(mrknames) || !all(mrknames %in% rownames(mrkDosage)))
      stop("Not all markers in haploblock occur in mrkDosage")
  } else {
    #haploblock not a list: then must be the name of the only
    #haploblock present, which contains all markers in mrkDosage
    if (!is.vector(haploblock) || length(haploblock) != 1)
      stop("Haploblock invalid")
    hbname <- as.character(haploblock)
    haploblock <- list(rownames(mrkDosage))
    names(haploblock) <- hbname
  }
  if (is.null(names(haploblock)))
    names(haploblock) <- paste0("hb", padded(seq_len(length(haploblock))))
  nonames <- which(is.na(names(haploblock)) | names(haploblock)=="")
  if (length(nonames) > 0)
    names(haploblock)[nonames] <- paste0("hb",padded(nonames))
  haploblock
} #checkHaploblock

# hapOneBlock approach
# A - For all FSs calculate the possible solutions, but dont apply any of them
#     From large to small, without grouping
# B - use all possible parental haplotypes as "known" to re-do
#     the skipped FSs
# C - Per group of related FSs with one or more solutions:
#     loop: find and resolve conflicts between FSs:
#     where both FSs have one or more solutions but the parent(s) cannot have
#     the same hac
#     For each parental solution, see for how many total FS progeny (over
#     multiple FSs) it is acceptable. Discard the one with the smallest total
#     FS progeny;
#     check for which FSs this was the only solution for that parent and
#     remove these FSs from further consideration (i.e. treat as unrelated).
#     Continue until no further conflicts: all remaining parental solutions
#     fit all remaining FSs where they are a parent
#     We do this over all parents at the same time, i.e. each time for
#     all parents calculate the solution applicable to the smallest total
#     FS progeny and reject that
# D - While there are parents with multiple solutions left (these fit all
#     remaining FSs of which they are parents, so these FSs also have multiple
#     solutions):
#     For each combination of remaining parental solutions over all remaining
#     parents calculate an overall P-value (by multiplying the P-values of all
#     remaining FSs for that combination).
#     If any single one of these combinations of two parental hacs has the
#     largest P-value, select that and apply it to all these remaining FSs;
#     else treat all remaining FSs as unrelated.
# E - solve all unrelated material (incl FSs for which no solutions found,
#     skipped FSs and FSs set aside during C, including their parents if not
#     solved yet;
#     for this use the original knownhap and the haplotypes of the ACCEPTED
#     FS solutions (i.e. not all the ones present at the end of step C,
#     which might be conflicting between FSs in the same group)

hapOneBlock <- function(mrkDosage, hbname, ahcinfo, parents, FS,
                        minfrac, errfrac, DRrate,
                        dropUnused, maxparcombs, minPseg,
                        knownHap, progress) {
  # called by inferHaplotypes
  # mrkDosage is now checked, a matrix with only the rows for the markers
  #           in the current haploblock
  # hbname: the name of the haploblock
  # all other parameters as in inferHaplotypes, plus ahcinfo: a list
  # returned by loadAllHapcombLists at start of inferHaplotypes
  # parents and FS are checked, and FS has length 0 if no FS families defined
  # Return value is a list with elements:
  # $hapdos: a matrix with one column per individual and one row per
  #          haplotype, with the dosages och each haplotypes in each indiv
  #          (if dropUnused, only the rows for the haplotypes that occur
  #           at least once in the population are retained). The columns are
  #           named as the individuals, the rows have the haplotype names.
  # $mrkdids: the observed mrkdids for all individuals, in same order as the
  #           columns of $hapdos; names are individual names
  # $allhap: a matrix with the composition of all haplotypes
  # $FSfit: logical vector with one element per FS family; TRUE if a (or
  #          more than one) acceptable solution for the FS is found (although
  #          if more than one solution they might not be used if unclear
  #          which is the best solution)
  # $FSmessages: character vector with one message per FS. If not FSfit then
  #          an error message; if FSfit either "" or a report of mrkdids that
  #          were not expected or that were not assigned a haplotype combination
  #          because multiple possible (and likely) combinations would give the
  #          same mrkdid
  # $FSpval: numeric vector with a P value for each FS. If one solution was
  #         chosen, the chi-squared P-value of that solution; else the maximum
  #         P-value over all considered solutions

  getFSgroupStructure <- function(FSgrouping, grp, FSdata, pcos, parents) {
    # FSgrouping: a list as produced by
    # grp: index to one group in FSgrouping
    # FSdata: list in hapOneBlock with one element per FS, we use here
    #         the $parcombok
    # pcos: a list with one element per FS, each of which is a vector indexing
    #       the still valid rows of the parcombok for that FS
    # parents: a matrix with the nams of the two parents of each FS
    # for one group of related FSs produce a list with item:
    # $availFSs: integer vector with FS numbers:
    #         which FSs belong to the group and still have at least
    #         1 solution (at least 1 row in FSsof[[fs]]$parcombok)
    # $availParents: character vector with the names of all parents of the
    #         availFSs
    # $FSmatlist: list of matrices, one for each availFS, with one row for each
    #         parent occurring in any of the availFSs, and one column for each
    #         parcomb (combination of 2 parental hacs).
    #         The matrices are filled with the ahccols of the hacs.
    #         (i.e. in each matrix only the 2 rows of the parents of that FS
    #         are not NAs)
    # $parhapcombs: list of vectors, one for each availParent, with all
    #         hacs (haplotype combinations) occurring in any of the
    #         solutions for any availFS for that parent.
    availFSs <- integer(0) #all FSs in the grp that still have some solutions
    # for (fs in FSgrouping$FSgroups[[grp]]) {
    #   #if (nrow(FSdata[[fs]]$parcombok) > 0) availFSs <- c(availFSs, fs)
    #   if (length(pcos[[fs]]) > 0) availFSs <- c(availFSs, fs)
    # }
    pcocounts <- sapply(pcos, length)
    availFSs <- intersect(FSgrouping$FSgroups[[grp]], which(pcocounts > 0))
    #           all FSs in group with some solutions left
    # the rest also works if no FSs left:
    availparents <- sort(unique(as.vector(parents[availFSs,]))) # names of
    #         parents of available (non-conflicting) FSs
    FSmatlist <- vector("list", length(availFSs)) #for each FS still
    #         available in the group, we make a matrix with one row for each
    #         parent occurring in any of them, and one column for each
    #         parcomb (combination of 2 parental hacs).
    #         The matrices are filled with the ahccols of the hacs,
    for (avfs in seq_along(availFSs)) {
      fs <- availFSs[avfs]
      #pccount <- nrow(FSdata[[fs]]$parcombok) # count of parental
      #       combinations for the current fs
      FSmatlist[[avfs]] <- matrix(integer(0), nrow=length(availparents),
                                  ncol=pcocounts[fs])
      rownames(FSmatlist[[avfs]]) <- availparents
      for (pc in seq_len(pcocounts[fs])) {
        prows <- match(parents[fs,], availparents)
        FSmatlist[[avfs]][prows, pc] <-
          FSdata[[fs]]$parcombok[pcos[[fs]][pc],] #already ahccols
      }
    } #for avfs

    # now we have the FSmatlist filled and we can directly
    # compare the ahccols for each parent between FSs
    # the FSmatlist matrices have NAs for the parents that don't occur
    # in that FS (i.e. only 2 rows non-NA)
    # Now we create a list with for each availparent the unique hapcombs
    # over all availFSs:
    parhapcombs <- vector("list", length(availparents))
    for (prow in seq_along(availparents)) {
      parhapcombs[[prow]] <- integer(0)
      for (avfs in seq_along(availFSs)) {
        parhapcombs[[prow]] <-
          c(parhapcombs[[prow]], FSmatlist[[avfs]][prow,])
      }
      parhapcombs[[prow]] <-
        sort(unique(parhapcombs[[prow]][!is.na(parhapcombs[[prow]])]))
    } # for prow
    names(parhapcombs) <- availparents
    list(availFSs=availFSs, availparents=availparents,
         FSmatlist=FSmatlist, parhapcombs=parhapcombs)
  } #getFSgroupStruct within hapOneBlock

  getAllParCombs <- function(parents, FSdata, FSgrpstruct) {
    # produce a list with
    # $allparcombs: a matrix with 1 parhapcomb per row with ahccols of all
    #               available parents, one column per available parent
    # $combPval: ("combined P-value") the product of the P-values over all FSs
    #            for the corresponding row of allparcomb
    allparcombs <- as.matrix(expand.grid(FSgrpstruct$parhapcomb)) #matrix with one row
    #              for each possible combination of ahccols of all parents;
    #              parents in columns, in order of parhapcomb
    Pval <- rep(1.0, nrow(allparcombs))
    for (fs in FSgrpstruct$availFSs) {
      ok <- rep(FALSE, nrow(allparcombs))
      parcols <- match(parents[fs,], colnames(allparcombs))
      for (cmb in seq_len(nrow(FSdata[[fs]]$parcombok))) {
        parahccols <- FSdata[[fs]]$parcombok[cmb,]
        okrow <- which(allparcombs[, parcols[1]] == parahccols[1] &
                       allparcombs[, parcols[2]] == parahccols[2])
        ok[okrow] <- TRUE
        Pval[okrow] <- Pval[okrow] * FSdata[[fs]]$Pseg[cmb]
      }
      allparcombs <- allparcombs[ok,, drop=FALSE]
      Pval <- Pval[ok]
    }
    list(allparcombs=allparcombs, Pval=Pval)
  } #getAllParCombs within hapOneBlock

  findParcombokRow <- function(allparcombs, selcomb, parents, FSsof, fs) {
    # allparcombs: matrix with all parents in group in columns and
    #             group solutions in rows, containing the parental hapixes
    # selcomb: one number: a row in allparcombs
    # parents: the 2-column matrix defining the parents of each FS
    # FSsof: the list with all solveOneFS results
    # fs: the fs we want
    # return value: the row number of FSsof$parcombok corresponding to
    #               the parental solution selcomb
    apccols <- match(parents[fs,], colnames(allparcombs))
    # for debugging:
    if (length(apccols) != 2 || anyNA(apccols)) stop("apccols error")
    pcorow <- which(FSsof[[fs]]$parcombok[, 1] ==
                      allparcombs[selcomb, apccols[1]]
                       &
                      FSsof[[fs]]$parcombok[, 2] ==
                      allparcombs[selcomb, apccols[2]])
    # for debugging:
    if (length(pcorow) != 1) stop ("pcorow error")
    pcorow
  } #findParcombokRow within hapOneBlock

  selOneMrkdidHac <- function(FSsof, pco, expdid) {
    # FSsof: list returned by solveOneFS
    # pco: the selected solution (row in FSsof$parcombok)
    # expdid: vector of all unique expected mrkdids in this FS
    # For the indicated solution FSsof$didNhapcomb[[pco]],
    # find out which hacs (haplotype combinations) correspond to the
    # given mrkdid (names of marker dosage combinations). If more than one,
    # select the one with the highest probability if much more probable than
    # the others, else none
    # return value: a matrix with ploidy rows and one column for each did in
    # expdid, will contain the unique solution (hac) for that did, or NA if
    # not one most likely hac. The names are the expdid.
    # TODO debug check:
    if (anyNA(expdid) || length(unique(expdid)) != length(expdid))
      stop("invalid expdid")
    hacmatrix <- matrix(integer(0),
                         nrow=nrow(FSsof$didNhapcomb[[1]]$hapcomb),
                         ncol=length(expdid),
                         dimnames=list(NULL, expdid))
    for (md in seq_along(expdid)) {
      haccols <- which(FSsof$didNhapcomb[[pco]]$mrkdid == expdid[md])
      if (length(haccols) == 0) {
        #results$didproblems[1, md] <- TRUE
      } else if (length(haccols) == 1) {
        #only one hapcomb corresponds to mrkdid, assign:
        hacmatrix[, md] <- FSsof$didNhapcomb[[pco]]$hapcomb[, haccols]
      } else { # length(haccols) > 1
        #multiple hapcomb correspond to same mrkdid. Assign only if one
        #hapcomb has a very high probability compared with the rest (to avoid
        #e.g. that DR solutions get in the way of assigning a good solution)
        didprobs <- FSsof$didNhapcomb[[pco]]$expHCfreq[haccols]
        didprobs <- didprobs / sum(didprobs)
        srtprobs <- sort(didprobs, decreasing=TRUE)
        if ((srtprobs[1] >= 0.8 && srtprobs[2] <= 0.05) ||
            (srtprobs[1] >= 0.5 && srtprobs[2] <= 0.01)) {
          # one most probable hac for this mrkdid, select:
          sel <- which(didprobs >= 0.5)
          hacmatrix[, md] <- FSsof$didNhapcomb[[pco]]$hapcomb[, haccols[sel]]
        } else {
          # not one single most likely hapcomb for this did,
          # leave result$hacmatrix column as NAs
        }
      } # multiple hacs for mrkdid
    } # for md
    hacmatrix
  } # selOneMrkdidHac within hapOneBlock

  newSolstats <- function(knownHap, FS) {
    stats <- list( # the initial "results" before cycle 1
      knownHap=knownHap, #the haplotypes present in the assigned parental
      # hapcombs and/or in the specified knownhaps
      Pahccols=matrix(NA_integer_, nrow=nrow(parents), ncol=2), # the assigned
      # parental hapcombs that are identical in all solutions, as ahccols -
      # these are considered certain and used as basis for next cycle
      selPahccols=matrix(NA_integer_, nrow=nrow(parents), ncol=2), # as
      # Pahccols, but now including the selected solutions for those that
      # were not the same over all solutions - these are not necessarily
      # certain and only used if this cycle is chosen as bestcycle
      pcos=vector("list", length(FS)), # for each FS a vector with the remaining
      # pcos (parental haplotype combinations), indexing FSdata[[fs]]$parcombok
      # and $Pseg
      selpco=rep(NA_integer_, length(FS)), # the selected pco, if any
      solvedFSindiv=0 # the number of FS indiv with mrkdids (i.e. with complete
      # marker data) in the FSs for which a solution is assigned
      # TODO: consider if we can use an extra statistic to select between
      # cycles with the same maximum solvedFSindiv, if the cycles don't converge
      # to a single solution
    )
    for (fs in seq_along(FS)) stats$pcos[[fs]] <- integer(0)
    stats
  } # function newSolstats

  resolveGroupConflicts <- function(FSgroup, FSgrpstruct, FSgrouping, FSdata,
                                    parents, solstats, cycle, ihl) {
    #This function considers all FS solutions for all FS in a group of
    #linked FS families and selects the best overall solution: a combination
    #of parental hapcombs that explains the largest number of FS progeny; and
    #FS and parent(s) that conflict with the rest of the group are removed
    #from the group (and will be considered as unrelated material later)
    #output: list with (possibly) modified FSgrpstruct, allparcombs, solstats
    FSconflicts <- TRUE
    while (FSconflicts) {
      FSconflicts <- FALSE
      FSgrpstruct <-
        getFSgroupStructure(FSgrouping=FSgrouping, grp=grp, FSdata=FSdata,
                            pcos=solstats[[cycle]]$pcos, parents=parents)
      # It is possible that no FSs are left in a group, if none of the FSs
      # has at least one solution. In that case we skip the rest of the loop
      # and leave FSconflicts FALSE to exit the loop.
      if (length(FSgrpstruct$availFSs) > 1) {
        # find and resolve conflicts between FSs:
        # where both FS have one or more solutions but the shared parent
        # cannot have the same hac.
        # For each parental solution, see for how many total FS progeny (over
        # multiple FSs) it is acceptable.
        # Discard the one with the smallest total FS progeny;
        # check for which FSs this was the only solution for that parent and
        # remove these FSs from further consideration (i.e. treat as unrelated).
        # Continue until no further conflicts: all remaining parental solutions
        # fit all remaining FSs where they are a parent
        miscount <- vector("list", length(FSgrpstruct$availparents)) # for each
        #           availparent a vector with for each hac the nr of FS progeny
        #           it does not match
        names(miscount) <- FSgrpstruct$availparents
        for (prow in seq_along(FSgrpstruct$availparents)) {
          # prow indicates the parent, row number in the FSgrpstruct$FSmatlist's
          # (and in allparcombs, but that doesn't exist at this stage)
          miscount[[prow]] <- rep(0, length(FSgrpstruct$parhapcombs[[prow]]))
          for (cix in seq_along(FSgrpstruct$parhapcombs[[prow]])) {
            # cix is the index of the current parental ahccol in
            # FSgrpstruct$parhapcombs[[prow]]
            phcmb <- FSgrpstruct$parhapcombs[[prow]][cix]
            # phcmb is the actual parental ahccol indicated by cix
            # now, for all FSs in grp:
            for (avfs in seq_along(FSgrpstruct$availFSs)) {
              fs <- FSgrpstruct$availFSs[avfs]
              if (FSgrpstruct$availparents[prow] %in% parents[fs,]) {
                # if the prow parent is a parent of this FS ...
                if(!(phcmb %in% FSgrpstruct$FSmatlist[[avfs]][prow,])) {
                  # ... and this ahccol for this parent is not in one of the
                  #     solutions of this FS, then this parental solution
                  #     "misses" all GENOTYPED indivs of the entire FS
                  miscount[[prow]][cix] <-
                    miscount[[prow]][cix] + FSdata[[fs]]$mrksize
                }
              }
            } # for avfs
          } # for cix
        } # for prow
        #now for each parental combination we know how many fully genotyped
        # FS progeny it misses
        #find the (any) maximum:
        maxmissedprogcount <- sapply(miscount, max)
        FSconflicts <- any(maxmissedprogcount > 0)
        if (!FSconflicts) break #ends the "while FSconflicts" loop
        maxprow <- which.max(maxmissedprogcount) #the parent with the most
        #                                missed progeny in its worst solution
        maxcix <- which.max(miscount[[maxprow]]) #the index to the solution itself
        maxphcmb <- FSgrpstruct$parhapcombs[[maxprow]][maxcix] #the ahccol
        #now we find in which availFSs this maxhcmb occurs for this parent;
        #the corresponding column is deleted from the FSmatlist matrices,
        #and for each FS in which this ahccol is a solution:
        #the index to the corresponding row in FSdata[[fs]]$parcombok
        #(and to the corresponding $Pseg) is deleted from
        #solstats[[cycle]]$pcos[[fs]]
        for (avfs in seq_along(FSgrpstruct$availFSs)) {
          cix <- which(FSgrpstruct$FSmatlist[[avfs]][maxprow,] == maxphcmb)
          if (length(cix) > 0) {
            # the maxprow parent occurs in this FS and the hapcomb to be deleted
            # is one of its possibilities in this FS
            fs <- FSgrpstruct$availFSs[avfs]
            solstats[[cycle]]$pcos[[fs]] <-
              solstats[[cycle]]$pcos[[fs]][-cix]
          }
        } # for avfs
      } else if(length(FSgrpstruct$availFSs) == 1) {
        # exactly 1 FS in this group, just apply its best solution:
        fs <- FSgrpstruct$availFSs
        solstats[[cycle]]$pcos[[fs]] <- which.max(FSdata[[fs]]$Pseg)
      } else {
        # length(FSgrpstruct$availFSs) == 0, no FS's left in this group
        # nothing to do ?
      }
    } # while FSconflicts
    # If there are parents with multiple solutions left (these fit all
    # remaining FSs of which they are parents, so these FSs also have
    # multiple solutions):
    # For each combination of remaining parental solutions over all
    # remaining parents calculate an overall P-value (by multiplying the
    # P-values of all remaining FSs for that combination).
    # (TODO: perhaps use selcrit instead of Pval? Then selcrit must also
    #        be saved in FSresults)
    # If any single one of these parental haplotype combinations has the
    # largest P-value, select that and apply it to all these remaining FSs;
    # else treat all remaining FSs as unrelated.

    FSgrpstruct <- getFSgroupStructure(FSgrouping, grp, FSdata,
                                       solstats[[cycle]]$pcos, parents)
    allparcombs <- getAllParCombs(parents, FSdata, FSgrpstruct)
    combinedPval <- allparcombs$Pval
    allparcombs <- allparcombs$allparcombs

    # debug check:
    if ((length(FSgrpstruct$availFSs) == 0 && nrow(allparcombs) > 0) ||
        (length(FSgrpstruct$availFSs) > 0 && nrow(allparcombs) == 0))
      stop("mismatch availFSs and allparcombs")
    # All FSs in a group that have 0 rows in FSdata[[fs]]$parcombok are not
    # among the availFSs anymore. This may be because there was no solution
    # in the first place (conflict parent/offspring, missing data parents,
    # too many missing data offspring, skipped because of too many parcombs)
    # or because the solutions all conflicted with other FSs in the group.
    # These FSs will not be solved but their FSdata$messages must be
    # filled (if that hasn't been done yet).
    # If there are FSs left over (still available), allparcombs has 1 valid
    # combination per row with ahccols of all available parents
    #
    # first treat all FSs in grp that are not in availFSs:
    # (including the case that there are no availFSs left over)
    for (fs in FSgrouping$FSgroups[[grp]]) {
      if (!(fs %in% FSgrpstruct$availFSs)) {
        #check for debugging:
        #if (nrow(FSsof[[fs]]$parcombok) > 0)
        if (length(solstats[[cycle]]$pcos[[fs]]) > 0)
          stop("error: length(pcos) should be 0")
      }
    }
    if (nrow(allparcombs) == 0) {
      # no solutions remaining for this group
      solstats[[cycle]]$knownHap <- knownHap # the original, specified knownhap
      # we leave solstats[[cycle]]$Pahccols and $solvedFSindiv
      # unchanged; $pcos should now all be integer(0) for the group FSs
    } else {
      # at least one solution for this group, and FSgrpstruct$availFSs
      # not empty (but perhaps not including all FSs)
      # keep the combination(s) of parental ahccols that has/have the
      # highest combinedPval:
      maxP <- max(combinedPval)
      bestcomb <- which(combinedPval > 0.99 * maxP) #1% tolerance for rounding errors
      # Now we first do the things that use all best solutions:
      # - mark the haplotypes that occur in all solutions of each parent as
      #   knownHap for this cycle
      # - mark the hacs of the parent(s) with the same hac over all
      #   solutions as "known"
      # - mark all FSs in group as FSfit

      tmpapc <- allparcombs[bestcomb,, drop=FALSE]
      # limit the solstats[[cycle]]$pcos to those allowed in any
      # of the bestcomb group solutions:
      for (fs in FSgrpstruct$availFSs) {
        p1ahccols <- tmpapc[, colnames(tmpapc)==parents[fs, 1]]
        p2ahccols <- tmpapc[, colnames(tmpapc)==parents[fs, 2]]
        pcos <- integer(0)
        for (i in seq_along(p1ahccols)) {
          pco <- which(FSdata[[fs]]$parcombok[, 1] == p1ahccols[i] &
                         FSdata[[fs]]$parcombok[, 2] == p2ahccols[i])
          if (length(pco) == 1 && !(pco %in% pcos)) {
            pcos <- c(pcos, pco)
          }
        }
        names(pcos) <- NULL # all added pco have name P1col, confusing
        # debug check:
        if (!all(pcos %in% solstats[[cycle]]$pcos[[fs]]))
          stop("mismatch in pcos")
        solstats[[cycle]]$pcos[[fs]] <- pcos
      } # for fs

      parmrkdids <- allmrkdids[match(colnames(tmpapc), names(allmrkdids))]
      # for each parent we determine the haplotypes common to all selected
      # solutions and add these to the knownHap for this cycle:
      for (p in seq_len(ncol(tmpapc))) {
        phacs <- getAllHapcomb(mrkdid=parmrkdids[p], nmrk=ncol(ihl$allhap),
                               ahcinfo=ahcinfo)[, tmpapc[, p], drop=FALSE]
        for (i in seq_len(ncol(phacs))) {
          if (i == 1) {
            phaps <- unique(phacs[, 1])
          } else phaps <- intersect(phaps, phacs[, i])
        }
        solstats[[cycle]]$knownHap <-
          union(solstats[[cycle]]$knownHap, phaps)
      }
      solstats[[cycle]]$knownHap <- sort(solstats[[cycle]]$knownHap)

      # for the parents with the same solution (same ahccol in all group
      # solutions), set that solution in solstats[[cycle]]$Pahscols
      uniqPar <- which(apply(tmpapc, MARGIN=2, max) ==
                         apply(tmpapc, MARGIN=2, min) )
      uniqPar <- colnames(tmpapc)[uniqPar] # names of the parents with a
      # unique solution
      for (fs in FSgrpstruct$availFSs) {
        for (p in 1:2) {
          up <- which(uniqPar == parents[fs, p])
          if (length(up) == 1) {
            solstats[[cycle]]$Pahccols[fs, p] <-
              tmpapc[1, colnames(tmpapc) == uniqPar[up]]
          }
        }
        # add the number of genotyped FS indivs if this FS has
        # a unique solution:
        if (!anyNA(solstats[[cycle]]$Pahccols[fs,])) {
          solstats[[cycle]]$solvedFSindiv <-
            solstats[[cycle]]$solvedFSindiv + FSdata[[fs]]$mrksize
        }
      }
      # debug check: the number of rows for the FSs with one single
      # Pahccol for both parents should be 1:
      for (fs in FSgrpstruct$availFSs) {
        if (!anyNA(solstats[[cycle]]$Pahccols[fs,])) {
          if (length(solstats[[cycle]]$pcos[[fs]]) != 1) {
            stop("both parents one solution but multiple parcombok rows")
          }
        }
      }


      # Next we decide which one (or no) solution to apply
      selectAmongBest <- 2
      # selectAmongBest: if there are <= selectAmongBest equally good
      # solutions we select the first one. If > selectAmongBest we don't
      # select any (but we already marked these FSs as "fitted" so they
      # also won't be solved later as unrelated material, which would
      # be an even worse solution than a random one among the best).
      #TODO?: we could then still check which solution requires the smallest
      #number of haplotypes; or we could check for each solution how many of
      #the "rest" individuals can be assigned and select the maximum.
      if (length(bestcomb) == 1) {
        selbestcomb <- bestcomb
      } else if (length(bestcomb) <= selectAmongBest) {
        selbestcomb <- bestcomb[1] #so we take the first of a small number
        # (2) of equally good solutions
      } else {
        selbestcomb <- NA # too many equally good solutions, cannot select
      }
      # now length(selbestcomb) == 1

      if (is.na(selbestcomb)) {
        # we can't assign more parental genotypes than those that are the
        # same in all solutions:
        solstats[[cycle]]$selPahccols[FSgrpstruct$availFSs,] <-
          solstats[[cycle]]$Pahccols[FSgrpstruct$availFSs,]
      } else {
        # we selected one solution (possible the first among several) and
        # assign the selPahccol for that solution:
        # note that we keep all solstats[[cycle]]$pcos[fs], i.e. also
        # for the non-selected bestcomb solutions
        tmpapc <- allparcombs[selbestcomb,, drop=FALSE] # 1 row
        for (fs in FSgrpstruct$availFSs) {
          for (p in 1:2) {
            pcol <- which(colnames(tmpapc) == parents[fs, p])
            #if (!is.na(pcol) && (pcol %in% uniqParAhccol))
            solstats[[cycle]]$selPahccols[fs, p] <- tmpapc[1, pcol]
            #debug check: Pahccol should be all equal to selPahccol or NA
            if (!is.na(solstats[[cycle]]$Pahccol[fs, p]) &&
                (is.na(solstats[[cycle]]$selPahccol[fs, p]) ||
                 solstats[[cycle]]$selPahccol[fs, p] !=
                 solstats[[cycle]]$Pahccol[fs, p]))
              stop("Pahccol and selPahccol don't match")
          } # for p
        } # for fs
      }
      for (fs in FSgrpstruct$availFSs) {
        selpco <-
          which(FSdata[[fs]]$parcombok[, 1] ==
                  solstats[[cycle]]$selPahccol[fs, 1] &
                  FSdata[[fs]]$parcombok[, 2] ==
                  solstats[[cycle]]$selPahccol[fs, 2])
        # may be integer(0) if one or both selPahccol[fs,] are NA
        # debug checkS:
        if (!(length(selpco) %in% 0:1))
          stop("selpco incorrect length")
        if (xor(length(selpco) == 0,
                anyNA(solstats[[cycle]]$selPahccol[fs,])))
          stop("selpco and selPahccol don't match NA")
        if (length(selpco) == 1) {
          # more debug checks:
          if (!all(FSdata[[fs]]$parcombok[selpco,] ==
                   solstats[[cycle]]$selPahccol[fs,]))
            stop("selpco and selPahccol don't match non-NA")
          if (!(selpco %in% solstats[[cycle]]$pcos[[fs]]))
            stop("selpco and pcos don't match")
          # all ok:
          solstats[[cycle]]$selpco[fs] <- selpco
        } else {
          # length(selpco)==0, no remaining solution for this fs
          solstats[[cycle]]$selpco[fs] <- NA
        }
      } # for fs
    } # nrow(allparcombs) > 0, at least one group solution
    list(FSgrpstruct=FSgrpstruct, allparcombs=allparcombs, solstats=solstats)
  } # resolveGroupConflicts

  imputeFSindiv <- function(mrkDosage, FS, fs, expdid, hac,
                              allmrkdids, imputedgeno, ahcinfo) {
    # the specified FS[[fs]] family has only one acceptable
    # parental hac combination. For partially (SNP-)genotyped
    # FS individuals where only one possibility exists for the
    # missing SNPs (given the parental hacs) we infer the missing
    # SNP dosages. These individuals are are also returned separately
    # in imputedGeno
    # export: allmrkdids. imputedGeno
    NAind <- which(is.na(colSums(mrkDosage)) &
                     colnames(mrkDosage) %in% FS[[fs]])
    if (length(NAind) > 0 && length(NAind) < 0.5 * length(FS[[fs]])) {
      expmrkdos <- mrkdid2mrkdos(dosageIDs=expdid, nmrk=nrow(mrkDosage),
                                 ploidy=ahcinfo$ploidy)
      impdids <-
        matrix(c(NAind, rep(NA, length(NAind))), nrow=2, byrow=TRUE,
               dimnames=list(c("NAind", "mrkdid"), NULL))
      for (i in seq_along(NAind)) {
        nai <- NAind[i]
        if (!all(is.na(mrkDosage[,nai]))) {
          md <- 1; matches <- integer(0)
          while (md <= ncol(expmrkdos) && length(matches) < 2) {
            if (all(is.na(mrkDosage[,nai]) |
                    mrkDosage[,nai] == expmrkdos[, md]))
              matches <- c(matches, md)
            md <- md + 1
          }
          if (length(matches) == 1)
            impdids[2, i] <- as.integer(colnames(expmrkdos)[matches])
        }
      }
      # impdids (imputed dids) calculated; check if the
      # segregation still fits parents:
      impdids <- impdids[, !is.na(impdids[2,]), drop=FALSE]
      if (ncol(impdids) > 0) {
        parhac <- hac[, colnames(hac) %in% parents[fs, ], drop=FALSE]
        # perhaps in reverse order; this doesn't matter
        # debug check:
        if (ncol(parhac) != length(unique(parents[fs,])) ||
            anyNA(parhac)) stop("error in parhac")
        totFSmrkdids <- c(allmrkdids[names(allmrkdids) %in% FS[[fs]]],
                          impdids[2,]) # includes NAs
        newFSdidstable <- table(totFSmrkdids, useNA="no")
        impstats <-
          testFSseg(parhac=parhac, DRrate=DRrate, errfrac=errfrac,
                    FSmrkdidsTable=newFSdidstable, allhap=ihl$allhap)
        #TODO: remove debug check:
        # the imputed mrkdids should be among the expected mrkdids, check:
        if (anyNA(impdids[2,]) || !all(impdids[2,] %in% expdid))
          stop("error in impdids")
        if (impstats$stats$P > 0.1 * FSpval[fs]) {
          # with the extra imputed FS mrkdids the fit of the FS family
          # is not too much worse, so add the imputed mrkdids
          allmrkdids[impdids[1,]] <- impdids[2,]
          # and we include the imputed mrk genotypes as a component
          # of the return value:
          FSimpGeno <-
            mrkdid2mrkdos(dosageIDs=impdids[2,], nmrk=nrow(mrkDosage),
                          ploidy=ahcinfo$ploidy)
          rownames(FSimpGeno) <- rownames(mrkDosage)
          colnames(FSimpGeno) <- colnames(mrkDosage)[impdids[1,]]
          imputedGeno <- cbind(imputedGeno, FSimpGeno)
        }
      }
    }
    return(list(allmrkdids=allmrkdids, imputedGeno=imputedGeno))
  } # imputeFSindiv

  # hapOneBlock START ####
  allmrkdids <- mrkdos2mrkdid(mrkDosage, ploidy=ahcinfo$ploidy, check=FALSE)
  if (length(FS) == 0) {
    ihl <- inferHaps_noFS(mrkdids=allmrkdids, ahcinfo=ahcinfo,
                           markers=rownames(mrkDosage),
                           minfrac=minfrac,
                           knownHap=knownHap, useKnownDosage=TRUE,
                           progress=progress)
    hapdos <- hapcomb2hapdos(
      hapcomb=hapcomb_from_IHlist(ihl=ihl, mrkdids=allmrkdids,
                                  ahcinfo=ahcinfo, nocombs=FALSE)$hapcomb,
      nhap=nrow(ihl$allhap))
    result <- list(
      hapdos=hapdos, mrkdids=allmrkdids, markers=colnames(ihl$allhap))
  } else {
    # one or more FS
    #first calculate ihl without using FS info:
    # - for elements of the returned list
    # - for a pre-selection of likely parental hapcomb
    ihl <- inferHaps_noFS(mrkdids=allmrkdids, ahcinfo=ahcinfo,
                           markers=rownames(mrkDosage),
                           minfrac=minfrac[1], #without final inference
                           knownHap=knownHap, useKnownDosage=FALSE,
                           progress=progress)
    ahccols <- hapcomb_from_IHlist(ihl=ihl, mrkdids=allmrkdids,
                                   ahcinfo=ahcinfo, nocombs=TRUE)$ahccols
    ihl <- ihl[names(ihl) != "hclist"] #sometimes big, not used here, delete
    nhap <- nrow(ihl$allhap)
    # we calculate the FS sizes per haploblock because of differences
    # in missing data:
    indiv <- colnames(mrkDosage)
    mrkcomplete <- !is.na(colSums(mrkDosage)) #for each indiv: complete marker dosage info?
    FSgrouping <- getLinkedFSgroups(parents=parents)

    # We perform a looping where in each cycle we find the optimal solution
    # per group of linked FSs, which yields
    # - the haplotypes present in the parents of the known FSs
    # - the solutions for the parents of the solved FSs
    # - the number of FS indiv with complete marker genotypes over all solved
    #   FSs
    FSdata <- vector(mode="list", length(FS)) #some accumulated data for each FS
    # over all cycles, as follows:
    for (fs in seq_along(FSdata)) {
      FSdata[[fs]]$mrksize <-
        sum(!is.na(allmrkdids[names(allmrkdids) %in% FS[[fs]]]))
      #  the number of indiv in the FS with complete marker info
      FSdata[[fs]]$message <- "" # will be about skipped or about FS-mrkdids with
      #  multiple possible hapcombs
      FSdata[[fs]]$NoSolution <- FALSE # will only be TRUE
      #  when all possible parcombs were checked and failed, i.e. when
      #  firstcombs and extracombs not skipped but no acceptable solution
      FSdata[[fs]]$SolFound <- FALSE # will remain FALSE until at least one
      #  acceptable solution found for this FS. Will then always remain TRUE
      #  even if all solutions in parcombok are discarded later due to conflicts
      #  with other FSs in the group
      FSdata[[fs]]$parcombok <- matrix(NA_integer_, nrow=0, ncol=2) # will remain
      #  empty (0 rows) until at least one acceptable solution found; and then
      #  not recomputed or changed anymore in subsequent cycles
      FSdata[[fs]]$Pseg <- numeric(0) # the chi-squared P-values for the mrkdid
      #  segregation for each row of parcombok; will be filled in at the same
      #  time as parcombok.
      FSdata[[fs]]$maxPseg <- NA # the maximum of $Pseg or if no solutions found
      #  the maximum of the discarded solutions (insufficient for acceptance).
      #  NA as long as no solutions considered
      FSdata[[fs]]$PahccolsTried <- list(P1=integer(0), P2=integer(0)) # the
      #  Pahccols of which the parcombs were checked; empty except when no
      #  acceptable solution found and some but but not all possible parhapcomb
      #  checked (i.e. firstcombs done but extracombs skipped; can be extended
      #  over multiple cycles but emptied when NoSolution TRUE or parcombok not
      #  empty anymore
    }

    solstats <- list() # for each cycle this will store the solutions and their
    # statistics over all FSs. "cycle 1" stores the initial data.
    solstats[[1]] <- newSolstats(knownHap, FS) # the initial "results" are
    # stored as the results of cycle 1; the first true cycle with computations
    # is cycle 2
    cycle <- 1 # "cycle 1" is the start situation
    while (TRUE) {
      cycle <- cycle + 1
      # did we converge to a stable solution or to a repeating set of different
      # solutions?
      if (cycle > 2) {
        bestcycle <- NA
        if (identical(solstats[[cycle-1]], solstats[[cycle-2]])) {
          bestcycle <- cycle - 1
        } else {
          oldcycle <- cycle - 3
          while (oldcycle > 0 &&
                 !identical(solstats[[cycle - 1]], solstats[[oldcycle]]))
            oldcycle <- oldcycle - 1
          if (oldcycle > 0) {
            # oldcycle had same solution as cycle-1: find the best cycle
            # in [oldcycle ...cycle-2]:
            maxSolvedFSindiv <- solstats[[oldcycle]]$solvedFSindiv
            minhapcount <- length(solstats[[oldcycle]]$knownhap)
            bestcycle <- oldcycle
            while (oldcycle < cycle - 2) {
              oldcycle <- oldcycle + 1
              if (solstats[[oldcycle]]$solvedFSindiv > maxSolvedFSindiv ||
                  (solstats[[oldcycle]]$solvedFSindiv == maxSolvedFSindiv &&
                   length(solstats[[oldcycle]]$knownhaps) < minhapcount)) {
                # we find a cycle with a better solution in terms of
                # solvedFSindiv and haplotype count
                # if two of the repeating cycles have the same solvedFSindiv
                # and hapcount we use the first (arbitrarily).
                # TODO: we could also discard the entire FS-based solution of
                # this haploblock but that seems a waste. Or we may think of
                # an additional statistic. How often will this occur anyway?
                maxSolvedFSindiv <- solstats[[oldcycle]]$solvedFSindiv
                minhapcount <- length(solstats[[oldcycle]]$knownhap)
                bestcycle <- oldcycle
              }
            }
          }
        }
        if (!is.na(bestcycle)) break
      }
      # No break, go on with new cycle
      solstats[[cycle]] <- newSolstats(knownHap, FS) #only the specified haplotypes
      # 1: for each FS find a set of acceptable solutions
      ih <- inferHaps_noFS(mrkdids=allmrkdids, ahcinfo=ahcinfo,
                            markers=rownames(mrkDosage),
                            minfrac=minfrac[1],
                            knownHap=solstats[[cycle-1]]$knownHap,
                            useKnownDosage=FALSE)
      # find acceptable solutions per FS in this cycle:
      for (fs in seq_along(FS)) {
        if (!FSdata[[fs]]$NoSolution && !FSdata[[fs]]$SolFound) {
          sof <- solveOneFS(mrkdids=allmrkdids, ihl=ih, FS=FS[[fs]],
                            P=parents[fs,],
                            knownPhac=solstats[[cycle-1]]$Pahccols[fs,],
                            triedPhacs=FSdata[[fs]]$PahccolsTried,
                            maxparcombs=maxparcombs,
                            ahcinfo=ahcinfo, minfrac=minfrac, errfrac=errfrac,
                            DRrate=DRrate, minPseg=minPseg)
          # we update FSdata[[fs]]:
          FSdata[[fs]]$message <- sof$message
          FSdata[[fs]]$NoSolution <- sof$NoSolution
          FSdata[[fs]]$SolFound <- sof$SolFound
          FSdata[[fs]]$parcombok <- sof$parcombok
          FSdata[[fs]]$Pseg <- sof$Pseg
          FSdata[[fs]]$maxPseg <- sof$maxPseg #different from max(Pseg) if no solution
          #FSdata[[fs]]$didNhapcombLst <- sof$didNhapcombLst #20151115 incorrect? sof has didNhapcomb, not List
          # a list with for each pco of this fs a list with for each FS mrkdid
          # a vector of the possible ahccols
          solstats[[cycle]]$pcos[[fs]] <- seq_len(nrow(FSdata[[fs]]$parcombok))
          for (p in 1:2) FSdata[[fs]]$PahccolsTried[[p]] <-
            sort(union(FSdata[[fs]]$PahccolsTried[[p]],
                       sof$PahccolsTried[[p]]))
        } else {
          solstats[[cycle]]$pcos[[fs]] <- solstats[[cycle-1]]$pcos[[fs]]
        }
      } # for fs
      # find the best solution over all FSs in this cycle:

      # Per group of linked FSs with one or more solutions:
      # loop: find and resolve conflicts between FSs:
      # where both FS have one or more solutions but the parent(s) cannot
      # have the same haplotype combination
      # For each parental solution, see for how many total FS progeny
      # (over multiple FSs) it is acceptable. Discard the one with the
      # smallest total FS progeny; check for which FSs this was the only
      # solution for that parent and remove these FSs from further
      # consideration (i.e. treat as unrelated).
      # Continue until no further conflicts: all remaining parental
      # solutions fit all remaining FSs where they are a parent.
      # We do this over all parents at the same time, i.e. each time
      # for all parents calculate the solution applicable to the smallest
      # total FS progeny and reject that.
      for (grp in seq_along(FSgrouping$FSgroups)) {
        rgc <- resolveGroupConflicts(FSgroup=grp, FSgrpstruct=FSgrpstruct,
                              FSgrouping=FSgrouping, FSdata=FSdata,
                              parents=parents, solstats=solstats, cycle=cycle,
                              ihl=ihl)
        FSgrpstruct <- rgc$FSgrpstruct
        allparcombs <- rgc$allparcombs
        solstats <- rgc$solstats
        rm(rgc)
      } # for grp
    } # while(cycle)

    # hapOneBlock after cycles ####

    # now we know which cycle (bestcycle) has the optimum solution,
    # and with FSdata and solstats[[bestcycle]] we will now get all the results.
    # We assign the parental hacs from solstats[[bestcycle]]$selPahccols
    # we assign the FS individuals from the FS calculations based on these Phacs
    # we assign the rest (including the non-fitted FSs and their parents) based
    # on the known haplotypes in solstats[[bestcycle]]$knownhap
    hac <- matrix(integer(0), nrow=ahcinfo$ploidy, ncol=ncol(mrkDosage),
                  dimnames=list(NULL, colnames(mrkDosage)))
    FSfit <- logical(length(FS)) # all FALSE
    FSpval <- rep(NA_real_, length(FS)) # all NA
    rest <- setdiff(indiv, c(parents, unlist(FS)))
    imputedGeno <- matrix(integer(0), nrow=nrow(mrkDosage),
                          dimnames=list(rownames(mrkDosage), NULL))
    for (fs in seq_along(FS)) {
      pcorows <- solstats[[bestcycle]]$pcos[[fs]] # even if one
      # group solution (bestcomb) was chosen among several (2) equally good
      # ones, solstats[[cycle]]$pcos[[fs]] is not reduced to 1 but keeps all
      # rows allowed by bestcomb (the one selected solution, if any, is in
      # solstats[[cycle]]$selpco[fs] ). Similarly, solstats[[cycle]]$Pahccol
      # contains only non-NA ahccols for parents that have the same solution
      # across all bestcomb, while solstats[[cycle]]$selPahccol has also
      # ahccols for parents based on the selected selbestcomb group solution.
      if (length(pcorows) == 0) {
        # should never happen?
        if ((nrow(FSdata[[fs]]$parcombok) > 0)) {
          FSdata[[fs]]$message <- "parental conflict with other FS(s)"
        } else if (FSdata[[fs]]$message == "") {
          FSdata[[fs]]$message <- "no solution found"
        }
        # else: length(pcorows)==0, nrow(parcombok)==0 and message not '';
        # message already contains "no solution found" or something more
        # informative(?), we don't overwrite it.
        FSpval[fs] <- FSdata[[fs]]$maxPseg
        rest <- union(rest, c(parents[fs,], FS[[fs]])) # note that some parents
        # may still be filled in from other FSs
      } else {
        # length(pcorows) > 0 :
        # there are 1 or more solutions (pcorows) for this fs in bestcycle
        # It is possible that there is just one parental combination valid
        # (if length(pcorows)==1; in that case
        # solstats[[bestcycle]]$selPahccols[fs, ] are both known), or that
        # there are multiple parental combinations possible; in that case
        # we don't want to assign the parent(s) with multiple solutions but
        # still we want to assign the FS indivs with the same solution in all
        # possible parental combinations.
        # we assign the parent(s) with the same solution over all, if any:
        for (p in 1:2) {
          if (!is.na(solstats[[bestcycle]]$selPahccols[fs, p])) {
            # one single selected hac for this parent, assign:
            Pdid <- allmrkdids[names(allmrkdids)==parents[fs, p]]
            hac[, colnames(hac)==parents[fs, p]] <-
              getAllHapcomb(mrkdid=Pdid, nmrk=ncol(ihl$allhap),
                            ahcinfo=ahcinfo)[, solstats[[bestcycle]]$selPahccols[fs, p]]
          }
        }
        if (!is.na(solstats[[bestcycle]]$selpco[fs])) {
          # one solution for this FS selected (but there might be more
          # than one solstats[[bestcycle]]$pcos[[fs]])
          FSfit[[fs]] <- TRUE
          FSpval[fs] <- FSdata[[fs]]$Pseg[solstats[[bestcycle]]$selpco[fs]]
        } else {
          # no selected pco for this FS but >=1 FS solutions (pcorows) found
          FSdata[[fs]]$message <- "multiple FS solutions"
          FSpval[fs] <- max(FSdata[[fs]]$Pseg[pcorows])
        }
        # Next we assign all FS indivs with a mrkdid that has a unique solution
        # over all (1 or more) possible parental combinations:
        if (!is.na(solstats[[bestcycle]]$selpco[fs]))
          pcorows <- solstats[[bestcycle]]$selpco[fs]
        # else (is.na(selpco)): keep on using all pcorows)
        unexp_dids <- integer(0)
        multisol_dids <- integer(0)
        for (pcor in seq_along(pcorows)) {
          pcr <- pcorows[pcor]
          sof <- solveOneFS(mrkdids=allmrkdids, ihl=ihl, FS=FS[[fs]],
                            P=parents[fs,],
                            knownPhac=FSdata[[fs]]$parcombok[pcr,], #known Phac both parents
                            triedPhacs=list(integer(0), integer(0)), #to have sof recomputed
                            maxparcombs=maxparcombs,
                            ahcinfo=ahcinfo, minfrac=minfrac, errfrac=errfrac,
                            DRrate=DRrate, minPseg=minPseg)
          # This sof has only one row in its parcombok because only one, valid,
          # combination of parents was tested! So sof$parcombok is different
          # from FSdata[[fs]]$parcombok, which may have more than 1 row
          if (pcor == 1) {
            expdid <- unique(sof$didNhapcomb[[1]]$mrkdid) #original may contain (consecutive) duplicates
            mdhacs <- selOneMrkdidHac(sof, pco=1, expdid=expdid) # mdhacs is matrix of hacs, not vector of ahccols
            if (length(pcorows) == 1) {
              #TODO remove debug check
              if (FSdata[[fs]]$message != "") stop("FS message not empty")
              # this FS has a unique solution, we can test for unexpected mrkdids
              fsmrkdids <- allmrkdids[names(allmrkdids) %in% FS[[fs]]]
              fsdidstable <- table(fsmrkdids)
              unexp_dids <- setdiff(names(fsdidstable), expdid) #colnames(somh$didproblems)[somh$didproblems[1,]]
              sum_unexp_dids <- sum(fsmrkdids %in% unexp_dids) #sof$didNhapcomb[[1]]$mrkdid %in% unexp_dids)
              if (sum_unexp_dids > 0) {
                FSdata[[fs]]$message <-
                  paste0("unexpected mrkdid(s) ",
                         paste(unexp_dids, collapse=", "),
                         ", n=", sum_unexp_dids)
              }
              multisol_dids <-
                intersect(colnames(mdhacs)[is.na(mdhacs[1,])], names(fsdidstable))
              sum_multisol_dids <- sum(fsmrkdids %in% multisol_dids)
              if (sum_multisol_dids > 0) {
                if (FSdata[[fs]]$message != "")
                  FSdata[[fs]]$message <- paste0(FSdata[[fs]]$message, "; ")
                FSdata[[fs]]$message <-
                  paste0(FSdata[[fs]]$message,
                         "multiple solutions for mrkdid(s) ",
                         paste(multisol_dids, collapse=", "),
                         ", n=", sum_multisol_dids)
              }
              # length(pcorows)==1, so a unique parental combination
              # we check if there are partially genotyped FS individuals
              # that can have only one expected mrkdid; these can be imputed.
              # (and later they will then also be haplotyped, if the imputed
              # mrkdid has only one allowed hac)
              impfs <- imputeFSindiv(mrkDosage=mrkDosage, FS=FS, fs=fs,
                                     expdid=expdid, hac=hac,
                                     allmrkdids=allmrkdids,
                                     imputedgeno=imputedGeno, ahcinfo=ahcinfo)
              allmrkdids <- impfs$allmrkdids
              imputedGeno <- impfs$imputedGeno
              rm(impfs)
            } # if (length(pcorows) == 1)
            mdhacs <- mdhacs[, !is.na(mdhacs[1,]), drop=FALSE]
            # end of pcor == 1
          } else if (ncol(mdhacs) > 0) {
            # already looked at pcorows[1 ... pcor-1]; now see if there are common
            # mdhacs between those and the current pcorow.
            # each solution has a different set of expected mrkdid;
            # delete the ones that are not common to all.
            # get the hacs from the new pcorow that also occurred in the earlier pcorows:
            tmphacs <- selOneMrkdidHac(sof, pco=1, #because this sof has only one row in parcombok
                                       expdid=colnames(mdhacs)) #$hacmatrix #the earlier mdhacs
            tmphacs <- tmphacs[, !is.na(tmphacs[1,])]
            # keep only the mdhacs from the earlier rows that also occur
            # in the current row:
            mdhacs <- mdhacs[, colnames(mdhacs) %in% colnames(tmphacs),
                             drop=FALSE]
            # keep only the mdhacs for mrkdids that HAVE THE SAME SELECTED HACs
            # in all solutions:
            id <- logical(ncol(mdhacs))
            for (md in seq_along(id)) {
              did <- colnames(mdhacs)[md]
              id[md] <- identical(mdhacs[, md], tmphacs[, colnames(tmphacs)==did])
            }
            mdhacs <- mdhacs[, id, drop=FALSE]
          }
        } # for pcor
        # mdhacs now has one ahccolnr for all FS mrkdids that
        # have the same selected hac over all pcorows.
        # we now assign all FS individuals that have one of these mrkdids:
        for (mdhcol in seq_len(ncol(mdhacs))) {
          did <- as.numeric(colnames(mdhacs)[mdhcol])
          ind <- intersect(FS[[fs]],
                           names(allmrkdids)[!is.na(allmrkdids) & allmrkdids==did])
          hac[, colnames(hac) %in% ind] <-
            mdhacs[, mdhcol]
          #TODO remove DEBUG CHECK:
          # if (length(ind) > 0) {
          #   nwdid <-
          #     mrkdos2mrkdid(
          #       mrkDosage=hapcomb2mrkdos(hapcomb=hac[, colnames(hac) %in% ind],
          #                                allhap=ihl$allhap),
          #       ploidy=ahcinfo$ploidy,
          #       check=FALSE)
          # }
        } # for mdhcol
      } #length(pcorows) > 0
    } # for fs

    # finally add the rest = the original rest plus the FSs for which no
    # solution or parental conflicts (but not the FSs that were left unsolved
    # because of too many equally good solutions)
    rest_ind <- intersect(rest, names(allmrkdids)[!is.na(allmrkdids)])
    rest_ihl <- inferHaps_noFS(mrkdids=allmrkdids[names(allmrkdids) %in% rest_ind],
                                ahcinfo=ahcinfo,
                                markers=rownames(mrkDosage),
                                minfrac=minfrac,
                                knownHap=solstats[[bestcycle]]$knownHap,
                                useKnownDosage=TRUE)
    for (md in seq_along(rest_ihl$mrkdidsTable)) {
      did <- names(rest_ihl$mrkdidsTable)[md]
      if (length(rest_ihl$hclist[[md]]) == 1) {
        ind <- names(allmrkdids)[(names(allmrkdids) %in% rest_ind) &
                                  allmrkdids==did]
        if (length(ind) > 0) {
          hac[, colnames(hac) %in% ind] <-
          getAllHapcomb(mrkdid=did, nmrk=ncol(ihl$allhap),
                        ahcinfo=ahcinfo)[, rest_ihl$hclist[[md]]]
          #TODO remove DEBUG CHECK:
          # nwdid <-
          #   mrkdos2mrkdid(
          #     mrkDosage=hapcomb2mrkdos(hapcomb=hac[, colnames(hac) %in% ind],
          #                              allhap=ihl$allhap),
          #     ploidy=ahcinfo$ploidy,
          #     check=FALSE)
        }
      }
    } # for md
    #for all indiv with a unique combination of haplotypes we now have
    #stored the solution (haplotype combination) in hac
    #
    #Finally, for parents that are among the rest (i.e. parents of FSs for which
    #no solution FS was found, and that were therefore solved as unrelated
    #material) we check if their solution conflicts with too many
    #of their offspring - the idea is that the parent may be mis-genotyped
    #which results in a failed FS solution.
    # There are 2 situations:
    # - both parents are in rest: then both parents must be checked separately
    #   against the FS progeny, and if both pass then the combination must be
    #   checked
    # - one of the parents is also parent of a non-rejected FS. If that parent
    #   has a final hac, the "rest" parent must be checked in combination with
    #   this; if the other-FS parent did not get a hac in the end, the "rest"
    #   parent is checked separately
    for (fs in seq_along(FS)) {
      restpar <- which(parents[fs,] %in% rest)
      FShac <- hac[, !is.na(hac[1,]) & colnames(hac) %in% FS[[fs]], drop=FALSE]
      if (length(restpar) > 0 && !all(is.na(hac[1, parents[fs, restpar]])) &&
          ncol(FShac[]) > 4) {
        #only in this case (some rest-parents with assigned hacs, and there
        #are "sufficient" (>4) haplotyped FS progeny to test against) we need
        #to do something.
        # first get a list of parents; gametes for each parent, with DR:
        pargamlist <- vector(mode="list", 2)
        for (p in 1:2) {
          if (is.na(hac[1, parents[fs, p]])) {
            pargamlist[[p]] <- "no data"
          } else {
            pargamlist[[p]] <- list()
            pargamlist[[p]]$DR <-
              getGameteFreqs(parhac=hac[, parents[fs, p]], DRrate=0.04)
            #  we only need the set of possible gametes; their freq (and so the exact
            #  DRrate) is not used. Only with DRrate=0 we would exclude all DR gametes
          }
        }
        # next get the FShac ordered so that identical columns are adjacent:
        o <- do.call("order", lapply(1:nrow(FShac), function(i) FShac[i,]))
        FShac <- FShac[, o, drop=FALSE]
        # is one of the parents also parent of a non-rest FS?
        if (length(restpar) == 1) {
          #one of the parents in rest (and we know it has a non-missing hac)
          if (parents[fs, 1] %in% rest) p <- 1 else p <- 2
          # p is the "rest" parent, 3-p the non-rest; we test only p separately
          if (is.na(hac[1, parents[fs, 3-p]])) {
            # the non-rest parent has missing hac
            chk <- checkFS_OneParent(pargamlist[[p]], FShac=FShac, nhap=nhap)
            if (nrow(chk) != 1) stop("error")
            nok <- rowSums(chk)
            if (nok/ncol(chk) < 0.9 && ncol(chk)-nok > 4) {
              # this parent has too many conflicts with its progeny: more than 10%
              # and more than 4 of the haplotyped progeny conflict.
              hac[, parents[fs, p]] <- NA
            }
          } else {
            # the non-rest parent has hac, we test p in combination with the other
            chk <- checkFS_TwoParents(pargamlist, FShac)
            if (nrow(chk) != 1) stop("error")
            nok <- rowSums(chk)
            if (nok/ncol(chk) < 0.9 && ncol(chk)-nok > 4) {
              hac[, parents[fs, p]] <- NA
              # parent 3-p might also not fit this FS but it is supported by
              # other FSs
            }
          }
        } else {
          # both parents in rest; first test separately:
          for (p in 1:2) if (!is.na(hac[1, parents[fs, p]])) {
            chk <- checkFS_OneParent(pargamlist[[p]], FShac=FShac, nhap=nhap)
            if (nrow(chk) != 1) stop("error")
            nok <- rowSums(chk)
            if (nok/ncol(chk) < 0.9 && ncol(chk)-nok > 4) {
              # this parent has too many conflicts with its progeny: more than 10%
              # and more than 4 of the haplotyped progeny conflict.
              hac[, parents[fs, p]] <- NA
            }
          }
          # next we test the parental combination if both still not NA
          if (!anyNA(hac[1, parents[fs,]])) {
            chk <- checkFS_TwoParents(pargamlist, FShac)
            if (nrow(chk) != 1) stop("error")
            nok <- rowSums(chk)
            if (nok/ncol(chk) < 0.9 && ncol(chk)-nok > 4) {
              hac[, parents[fs,]] <- NA
              # both parent hacs set to NA
            }
          }
        }
      }
    } # for fs

    hapdos <- hapcomb2hapdos(hapcomb=hac, nhap=nrow(ihl$allhap))
    rownames(hapdos) <-
      getHaplotypeNames(haploblock=hbname, hapcount=nrow(ihl$allhap))
    result <-
      list(hapdos=hapdos, mrkdids=allmrkdids, markers=colnames(ihl$allhap),
           FSfit=FSfit,
           FSmessages=sapply(FSdata, `[[`, "message"),
           FSpval=FSpval, imputedGeno=imputedGeno)
    # TODO?:(if rest_ihl with two minfrac values results in additional haplotypes,
    # see if these can help solve the remaining FSs? And then re-iterate?
  } # one or more FSs
  if (dropUnused) {
    result$hapdos <-
      result$hapdos[rowSums(result$hapdos, na.rm=TRUE) > 0, , drop=FALSE]
    #has 0 rows if no indiv has a unique solution
  }
  result
} # hapOneBlock

getLinkedFSgroups <- function(parents) {
  #parents: a matrix with 2 columns and one row for each FS family,
  #         with the names of the mother and father of each family.
  #Result: a list with 4 items:
  # $FSgroups: a list with one item for each group of linked FSs;
  #            each of the items is an integer vector with the row numbers
  #            of the FSs in the group
  # $FSgroupParents: a list with the same number of items as FSgroups,
  #            each item is a character vector with the names of all
  #            parents in that group (without duplicates)
  # $FSgroupnrs: an integer vector with for each FS the group it belongs to
  # $ParentGroupnrs: a named integer vector with for each parent the group
  #                  it belongs to
  # test with:
  # parents <- matrix(c("A","B","C","D","E","A","G","H","H","E"), ncol=2, byrow=TRUE)
  #
  #initially each FS gets its own item in both lists:
  FSgroups <- as.list(seq_len(nrow(parents)))
  FSgroupParents <- lapply(as.list(data.frame(t(parents), stringsAsFactors=FALSE)), sort)
  #merge groups based on shared parents:
  changes <- TRUE
  while (changes) {
    changes <- FALSE
    g <- 2
    while (!changes && g <= length(FSgroupParents)) {
      h <- 1
      while (!changes && h<g) {
        if (length(intersect(FSgroupParents[[g]], FSgroupParents[[h]])) > 0) {
          changes <- TRUE
          FSgroupParents[[h]] <- sort(unique(c(FSgroupParents[[g]], FSgroupParents[[h]])))
          FSgroups[[h]] <- sort(c(FSgroups[[g]], FSgroups[[h]]))
          FSgroupParents <- FSgroupParents[-g]
          FSgroups <- FSgroups[-g]
        }
        h <- h+1
      }
      g <- g+1
    }
  }
  FSgroupnrs <- integer(nrow(parents))
  uniparents <- sort(unique(as.vector(parents)))
  Parentgroupnrs <- integer(length(uniparents))
  names(Parentgroupnrs) <- uniparents
  for (g in seq_along(FSgroups)) {
    FSgroupnrs[FSgroups[[g]]] <- g
    Parentgroupnrs[names(Parentgroupnrs) %in% FSgroupParents[[g]]] <- g
  }
  list(FSgroups=FSgroups, FSgroupParents=FSgroupParents,
       FSgroupnrs=FSgroupnrs, Parentgroupnrs=Parentgroupnrs)
} #getLinkedFSgroups

getNewMrkdids <- function(mrkDosages, haploblock, maxmrk, ahcinfo) {
  # non-exported function only for use by function inferHaplotypes.
  # Parameters as the corresponding ones of inferHaplotypes but already
  # checked, and mrkDosages as matrix not data.frame with only the selected
  # indiv.
  # output: a list with one element for each haploblock size (nr of markers),
  # from 1 to the maximum haploblock size in haploblock (or to maxmrk if that
  # is smaller);
  # each element has all mrkdid (as integers) that occur over all haploblocks
  # of that size and for which no allhapcomb is yet available in
  # ahclist or ahccompletelist
  hbsizes <- sapply(haploblock, length)
  maxmrk <- min(maxmrk, max(hbsizes))
  result <- vector(mode="list", maxmrk)
  if (maxmrk <= ahcinfo$complete_nmrk) return(result)
  # we need to check all mrkdids for new ones
  knownmrkdid <- result # list of same length
  for (hbsize in (ahcinfo$complete_nmrk+1):maxmrk) {
    if (length(ahcinfo$ahclist) < hbsize) {
      knownmrkdid[[hbsize]] <- integer(0)
    } else knownmrkdid[[hbsize]] <- as.integer(names(ahcinfo$ahclist[[hbsize]]))
  }
  for (hb in seq_along(haploblock)) {
    if (hbsizes[[hb]] %in% (ahcinfo$complete_nmrk+1):maxmrk) {
      # mrkdids <-
      #   unique(mrkdos2mrkdid(mrkDosages[rownames(mrkDosages) %in%
      #                                     haploblock[[hb]],,drop=FALSE],
      #                        ploidy=ploidy, check=FALSE))
      uniqdos <-
        unique(mrkDosages[match(haploblock[[hb]], rownames(mrkDosages)),,
                          drop=FALSE], MARGIN=2)
      mrkdids <- mrkdos2mrkdid(uniqdos, ploidy=ahcinfo$ploidy, check=FALSE)
      result[[hbsizes[hb]]] <- c(result[[hbsizes[hb]]],
                                 setdiff(mrkdids, knownmrkdid[[hbsizes[hb]]]))
    }
  }
  for (hbsize in (ahcinfo$complete_nmrk+1):maxmrk) {
    if (!is.null(result[[hbsize]]))
        result[[hbsize]] <- setdiff(unique(result[[hbsize]]), NA)
  }
  result
} #getNewMrkdids

addNewMrkdids <- function(newMrkdids, ahcinfo, progress) {
  # non-exported function only for use by function inferHaplotypes.
  # newMrkdids: list as produced by getNewMrkdids
  # Parameters as the corresponding ones of inferHaplotypes but already
  # checked, and mrkDosages as matrix not data.frame with only the selected
  # indiv.
  # output: a list with one element for each haploblock size (nr of markers),
  # from 1 to the maximum haploblock size in haploblock (or to maxmrk if that
  # is smaller);
  # each element has all mrkdid (as integers) that occur over all haploblocks
  # of that size and for which no allhapcomb is yet available in
  # ahclist or ahccompletelist
  newdidcounts <- sapply(newMrkdids, length)
  totnew <- sum(newdidcounts)
  if (any(newdidcounts > 0)) {
    maxnmrk <- max(which(newdidcounts > 0))
    # add the haplotype combinations for the newdids to ahcinfo$ahclist
    # and save the new ahclist in ahcdir (overwriting any earlier version)
    if (progress && (maxnmrk + ahcinfo$ploidy >= 12 ||
                     sum(newdidcounts) > 100)) {
      cat(paste(sum(newdidcounts),
                "sets of haplotype combinations need to be calculated;\n"))
      cat("this may take quite a long time\n")
    }
    cum <- 0
    for (m in which(newdidcounts > 0)) {
      if (length(ahcinfo$ahclist) < m || !is.list(ahcinfo$ahclist[[m]])) {
        ahcinfo$ahclist[[m]] <- list()
        # initialize also all lower nmrk:
        for (i in seq_len(m)) {
          if (!is.list(ahcinfo$ahclist[[i]]))
            ahcinfo$ahclist[[i]] <- list()
        }
      }
      allhap <- allHaplotypes(1:m)
      for (i in seq_along(newMrkdids[[m]])) {
        did <- newMrkdids[[m]][i]
        if (progress) {
          cat(paste0("\rcalculation of new mrkdids: ", cum + i, " / ", totnew))
          flush.console()
        }
        ahcinfo$ahclist[[m]][[as.character(did)]] <-
          calcAllhapcomb4mrkdid(mrkdid=did, ploidy=ahcinfo$ploidy, allhap=allhap)
        # save after every 500 mrkdids so we can resume after crash:
        if (i %% 500 == 0) {
          ahclist <- ahcinfo$ahclist # because save will only save an entire R object, not a list element
          save(ahclist,
               file=paste0(ahcinfo$ahcdir, "ahclist_", ahcinfo$ploidy, "x.RData"),
               version=2)
        }
      }
      cum <- cum + newdidcounts[m]
    }
    cat("\n"); flush.console()
    ahclist <- ahcinfo$ahclist # because save will only save an entire R object, not a list element
    save(ahclist,
         file=paste0(ahcinfo$ahcdir, "ahclist_", ahcinfo$ploidy, "x.RData"),
         version=2)
  }
  ahcinfo
}

#'@title infer haplotypes for one or more haploblocks
#'@description infer haplotypes for one or more haploblocks, for all individuals,
#'using FS family(s) (with parents) if present, and infer haplotypes for
#'non-FS material as well
#'@usage inferHaplotypes(mrkDosage, indiv=NULL, ploidy, haploblock,
#'parents=NULL, FS=NULL, minfrac=c(0.1, 0.01), errfrac=0.025, DRrate=0.025,
#'maxmrk=0, dropUnused=TRUE, maxparcombs=150000, minPseg=1e-8,
#'knownHap=integer(0), progress=TRUE, printtimes=FALSE, ahcdir)
#'@param mrkDosage matrix or data.frame. Markers are in rows, individuals in
#'columns, each cell has a marker dosage. Names of individuals are the column
#'names, marker names are the row names or (if a data.frame) in a column named
#'MarkerNames. All marker dosages must be in 0:ploidy or NA.
#'@param indiv NULL (default) or a character vector with names of all individuals
#'to be considered. If NULL, all columns of mrkDosage are selected.\cr
#'All indivs that are not in parents or FS vectors (see below) are considered
#'unrelated, i.e. we have no implementation for pedigrees (yet).
#'@param ploidy all marker dosages should be in 0:ploidy or NA
#'@param haploblock a list of character vectors. The names are the names of the
#'haploblocks, the character vectors have the names of the markers in each
#'haploblock. Haplotype names are constructed from the haploblock names, which
#'are used as prefixes to which the (zero-padded) haplotype numbers are are
#'appended with separator '_'.
#'@param parents a matrix with one row for each FS family and two columns
#'for the two parents, containing the names of the female and male parent
#' of each family.
#'@param FS a list of character vectors. Each character vector has the names of
#'the individuals of one FS family. The items of the list should correspond
#'to the rows of the parents matrix, in the same order.
#'@param minfrac vector of two fractions, default 0.1 and 0.01. A haplotype is
#'considered to be certainly present if it must occur in at least a fraction
#'minfrac[1] of all individuals; in the final stage for the "other"
#'individuals (those that do not belong to the FS or its parents) this fraction
#'is lowered to minfrac[2]; see also inferHaps_noFS
#'@param errfrac default 0.025. The assumed fraction marker genotypes with an error
#'(over all markers in the haploblock). The errors are assumed to be uniformly
#'distributed over all except the original marker dosage combinations (mrkdids)
#'@param DRrate default 0.025. The rate of double reduction per meiosis (NOT
#'per allele!); e.g. with a DRrate of 0.04, a tetraploid parent with
#'genotype ABCD will produce a fraction of 0.04 of DR gametes AA, BB, CC and DD
#'(each with a frequency of 0.01), and a fraction of 0.96 of the non-DR gametes
#'AB, AC, AD, BC, BD, CD (each with a frequency of 0.16)
#'@param maxmrk Haploblocks with more than maxmrk markers will be skipped.
#'Default 0: no haploblocks are skipped
#'@param dropUnused TRUE (default) if the returned matrix should only contain
#'rows for haplotypes that are present; if FALSE matrix contains rows for all
#'possible haplotypes
#'@param maxparcombs Parent 1 and 2 both may have multiple possible haplotype
#'combinations. For each pair of haplotype combinations (one from P1 and one
#'from P2) the expected FS segregation must be checked against the observed.
#'This may take a long time if many such combinations need to be checked.
#'This parameter sets a limit to the number of allowed combinations per
#'haploblock; default 150000 takes about 45 min.
#'@param minPseg default 1e-8. The minimum P-value of a chisquared test for
#'segregation in FS families. The best solution for an FS family is selected
#'based on a combination of P-value and number of required haplotypes, among all
#'candidate solutions with a P-value of at least minPseg. If no such solution
#'is found the FS and its parents are treated as unrelated material
#'@param knownHap integer vector with haplotype numbers (haplotypes that must be
#'present according to prior inference or knowledge, numbers refer to rows of
#'matrix produced by allHaplotypes); default integer(0), i.e. no known
#'haplotypes
#'@param progress if TRUE, and new haplotype combinations need to be calculated,
#'and the number of markers and the ploidy are both >= 6, progress is indicated
#'by printed messages
#'@param printtimes if TRUE, the time needed to process each haploblock is
#'printed
#'@param ahcdir a single directory, or not specified.
#'inferHaplotypes uses lists that for each combination of marker
#'dosages give all possible combinations of haplotype dosages. These lists
#'(ahclist and ahccompletelist) are loaded and saved at the directory
#'specified by ahcdir. If no ahcdir is specified it is set to the current
#'working directory.\cr
#'If an ahclist or ahccompletelist for the correct ploidy is already in
#'GlobalEnv this is used and no new list is loaded, even if ahcdir is specified.
#'@details
#'First we consider the case where one or more FS families and their
#'parents are present in the set of samples. In that case, initially the
#'possible haplotype configurations of the parents are determined.
#'From that, all their possible gametes (assuming polysomic
#'inheritance) are calculated and all possible FS haplotype configurations.
#'Comparing this with the observed FS marker dosages the most likely parental
#'and FS configurations are found.\cr
#'It is possible that multiple parental combinations can explain the observed
#'marker dosages in the FS. In that case, if one is clearly more likely and/or
#'needs less haplotypes, that one is chosen. If there is no clear best solution
#'still the parents and FS individuals that have the same haplotype
#'configuration over all likely solutions are assigned that configuration.\cr
#'For FS where no good solution is found (because of an error in the marker
#'dosages of a parent, or because the correct solution was not considered) the
#'parents and individuals will be considered as unrelated material.\cr
#'If several FS families share common parents they are treated as a group,
#'and only solutions are considered that are acceptable for all families
#'in the group.\cr
#'Finally (or if no FS families are present, immediately) the other samples
#'are haplotyped, which are considered as unrelated material. If FS families
#'have been solved the haplotypes in their parents are considered "known",
#'and known haplotypes can also be supplied (parameter knownHap). For these
#'samples we consecutively add haplotypes that must be present in a minimum
#'number of individuals, always trying to minimize the number of needed
#'haplotypes.\cr
#'InferHaplotypes uses tables that, for each combination of dosages of the
#'markers in the haploblock, list all haplotype combinations (ahc) that
#'result in these marker dosages. In principle inferHaplotypes uses a list
#'(ahccompletelist) that, for a given ploidy, has all the haplotype combinations
#'for haploblocks from 1 up to some maximum number of markers. This list can be
#'computed with function build_ahccompletelist. If this list is not available
#'(or is some haploblocks contain more markers than the list), the ahc for
#'the (extra) marker.\cr
#'See the PolyHaplotyper vignette for an illustrated explanation.
#'@return a list with for each haploblock one item that itself is a list
#'with items:\cr
#'message; if this is "" the haploblock is processed and further
#'         elements are present; else this message says why the haploblock was
#'         skipped (currently only if it contains too many markers)\cr
#'hapdos: a matrix with the dosages of each haplotype (in rows) for each
#'        individual (in columns). For each individual the haplotype dosages
#'        sum to the ploidy. If dropUnused is TRUE Only the haplotypes that
#'        occur in the population are shown, else all haplotypes\cr
#'mrkdids: a vector of the mrkdid (marker dosage ID) for each individual
#'         (each combination of marker dosages has its own ID; if any of the
#'         markers has an NA dosage the corresponding mrkdid is also NA).\cr
#'         The mrkdids can be converted to the marker dosages with function
#'         mrkdid2mrkdos.\cr
#'markers: a vector with the names of the markers in the haploblock\cr
#'imputedGeno: a matrix in the same format as param mrkDosage, with one row
#'        for each marker in the haploblock and one column per imputed
#'        individual, with the dosages of the markers. These are the individuals
#'        that have incomplete data in mrkDosage but where the available marker
#'        dosages match only one of the expected marker genotypes in the FS
#'        family (only individuals in FS families are imputed). It is possible
#'        that an individual with imputed marker dosages is not haplotyped (as
#'        is the case for individuals with complete marker data) if the
#'        marker dosages match different possible haplotype combinations.
#'The next elements are only present if one or more FS families were
#'specified:\cr
#'FSfit: a logical vector with one element per FS family; TRUE if a (or
#'        more than one) acceptable solution for the FS is found (although
#'        if multiple solution are found they might not be used if unclear
#'        which one is the best solution). (Even if no
#'        solution was found for an FS, still its individuals may have a
#'        haplotype combination assigned ignoring their pedigree)\cr
#'FSmessages: a character vector with one item per FS family: any
#'        message relating to the fitting of a model for that FS,
#'        not necessarily an error\cr
#'FSpval: a vector of the chi-squared P-value associated with the selected
#'        FS model for each FS family, or the maximum P value over all
#'        models in case none was selected\cr
#'If for new combinations of marker dosages the possible haplotype combinations
#'have to be calculated, an ahclist file is written to ahcdir
#'@examples
#'\donttest{
#'# this example takes about 1 minute to run:
#'data(PolyHaplotyper_small)
#'results <- inferHaplotypes(mrkDosage=phdos, ploidy=6,
#'haploblock=phblocks, parents=phpar, FS=phFS)
#'names(results)
#'names(results[[1]])
#'}
#'@export
inferHaplotypes <- function(mrkDosage, indiv=NULL, ploidy,
                            haploblock, parents=NULL, FS=NULL,
                            minfrac=c(0.1, 0.01), errfrac=0.025, DRrate=0.025,
                            maxmrk=0,
                            dropUnused=TRUE, maxparcombs=150000,
                            minPseg=1e-8,
                            knownHap=integer(0), progress=TRUE,
                            printtimes=FALSE,
                            ahcdir) {

  mrkDosage <- checkmrkDosage(mrkDosage, ploidy=ploidy, indiv=indiv)
  tmp <- checkpops(parents, FS, mrkDosage)
  if (tmp$message != "") stop(tmp$message)
  parents <- tmp$parents; FS <- tmp$FS; rm(tmp) #a.o. all as character
  haploblock <- checkHaploblock(haploblock, mrkDosage)
  nmrk <- max(sapply(haploblock, length))
  if (maxmrk == 0) maxmrk <- nmrk else maxmrk <- min(maxmrk, nmrk)
  if (missing(ahcdir)) ahcdir <- NULL
  ahcinfo <- loadAllHapcombLists(ploidy=ploidy, nmrk=maxmrk, ahcdir)
  # see if there are new mrkdids for which all haplotype combinations
  # must be calculated:
  newdids <- getNewMrkdids(mrkDosage, haploblock, maxmrk, ahcinfo)
  ahcinfo <- addNewMrkdids(newdids, ahcinfo, progress)
  nmrk <- sapply(haploblock, length)
  result <- vector(mode="list", length(haploblock))
  for (hb in seq_along(haploblock)) {
    starttime <- proc.time()
    cat(paste0("haploblock ", hb, " of ", length(haploblock), ": ",
               names(haploblock)[hb])); flush.console() # no \n !
    result[[hb]] <- list()
    if (nmrk[hb] > maxmrk) {
      result[[hb]]$message <- paste("skipped:", nmrk[hb],
                                    "markers while the limit is", maxmrk)
    } else {
      result[[hb]] <-
        hapOneBlock(mrkDosage[match(haploblock[[hb]], rownames(mrkDosage)),,
                              drop=FALSE],
                    hbname=names(haploblock)[hb],
                    ahcinfo=ahcinfo,
                    parents=parents, FS=FS, minfrac=minfrac,
                    errfrac=errfrac, DRrate=DRrate,
                    dropUnused=dropUnused,
                    maxparcombs=maxparcombs,
                    minPseg=minPseg,
                    knownHap=knownHap, progress=progress)
    }
    if (is.null(result[[hb]]$message)) result[[hb]]$message <- ""
    if (printtimes) {
      cat(paste(" -", format((proc.time()-starttime)[3], nsmall=2), "sec\n"))
    } else cat("\n")
    flush.console()
  } # for hb
  names(result) <- names(haploblock)
  result
} #inferHaplotypes

#'@title show marker and haplotype dosages for one FS family
#'@description show marker and haplotype dosages for one FS family and its
#'parents
#'@usage showOneFS(FSnr, hbresults, mrkDosage, FS, parents)
#'@param FSnr the number of the FS family (indexes the FS list and parents)
#'@param hbresults a list with the haplotyping results for one haploblock:
#'one element of a list as returned by inferHaplotypes
#'@param mrkDosage a matrix of marker dosages, may contain rows for more
#'markers than only those in the current haploblock (the relevant markers are
#'specified in the hbresults list)
#'@param FS a list of which each item is a (character) vector with the
#'names of the individuals in that FS family
#'@param parents a (character) matrix with 2 columns and one row for each
#'FS family in FS, with the names of the two parents of each family
#'@return a list with 3 elements:\cr
#'$mrkdat: a matrix with info on the marker dosages distribution in the FS. The
#'first two columns are for the parents: the parent name is between (brackets)
#'in the column name, their mrkdid (marker dosage ID) in row 1 and their marker
#'dosages below that. The remaining columns are for the different mrkdids
#'observed in the FS: the mrkdid itself in the column name, its frequency in
#'row 1 and its marker dosages below that. The final column gives the frequency
#'of individuals with one or more missing marker dosages.\cr
#'$hapdat: a matrix with similar layout as mrkdat, but now with the haplotype
#'dosages rather than the marker dosages. Some mrkdids (columns) may not have
#'a haplotype dosage combination assigned (if multiple possible haplotype
#'combinations result in the same marker dosages)\cr
#'usedhap: a matrix with the dosage (0 or 1) of each marker in each of the
#'used haplotypes; haplotype nrs in columns, markers in rows
#'@examples
#'data(PolyHaplotyper_small)
#'# show the results of the first FS family in the first haploblock:
#'showOneFS(FSnr=1, hbresults=phresults[[1]], mrkDosage=phdos,
#'          FS=phFS, parents=phpar)
#'
#'@export
showOneFS <- function(FSnr, hbresults, mrkDosage, FS, parents) {
  if (!all(colnames(hbresults$hapdos) %in% colnames(mrkDosage)))
    stop("colnames of hbresults$hapdos and mrkDosage don't match")
  if (!identical(colnames(hbresults$hapdos), names(hbresults$mrkdids)))
    stop("colnames of hbresults$hapdos and hbresults$mrkdids don't match")
  markers <- hbresults$markers
  if (!all(markers %in% rownames(mrkDosage)))
    stop("not all markers from hbresults in mrkDosage")
  mrkDosage <- mrkDosage[rownames(mrkDosage) %in% markers,]
  #check parents, make sure FS and parents are character:
  if (!is.matrix(parents) || nrow(parents) != length(FS))
    stop("parents must be a matrix matching FS")
  parents <- matrix(as.character(parents), nrow=nrow(parents))
  if (length(FSnr) != 1 || !(FSnr %in% (1:length(FS))))
    stop("invalid FSnr")
  FS <- as.character(FS[[FSnr]])
  FSdidstable <-
    table(hbresults$mrkdids[names(hbresults$mrkdids) %in% FS],
          useNA="always")
  res <- list()

  # first matrix mrkdat: showing the frq of all mrkdids and their marker dosages
  res$mrkdat <- matrix(NA_integer_, nrow=nrow(mrkDosage)+1,
                       ncol=length(FSdidstable)+2)
  rownames(res$mrkdat) <- c("frq", rownames(mrkDosage))
  colnames(res$mrkdat) <- c(paste0("(",parents[FSnr, 1], ")"),
                            paste0("(",parents[FSnr, 2], ")"),
                            names(FSdidstable))
  res$mrkdat[1, 1:2] <- hbresults$mrkdids[match(parents[FSnr,],
                                            colnames(mrkDosage))]
  res$mrkdat[1, 3:ncol(res$mrkdat)] <- FSdidstable
  res$mrkdat[2:nrow(res$mrkdat), 1:2] <-
    mrkDosage[, match(parents[FSnr,], colnames(mrkDosage))]
  for (i in seq_along(FSdidstable)) {
    mrkDosagecol <- which.max(hbresults$mrkdids == names(FSdidstable)[i])
    if (length(mrkDosagecol) == 1) #i.e. not for mrkdid <NA>
      res$mrkdat[2:nrow(res$mrkdat), i+2] <- mrkDosage[,mrkDosagecol]
  }

  # second matrix hapdat: same, but with haplotype dosages
  res$hapdat <- matrix(NA_integer_, nrow=nrow(hbresults$hapdos)+1,
                       ncol=length(FSdidstable)+2)
  rownames(res$hapdat) <- c("frq", rownames(hbresults$hapdos))
  colnames(res$hapdat) <- c(paste0("(",parents[FSnr, 1], ")"),
                            paste0("(",parents[FSnr, 2], ")"),
                            names(FSdidstable))
  res$hapdat[1, 1:2] <- hbresults$mrkdids[match(parents[FSnr,],
                                            colnames(hbresults$hapdos))]
  res$hapdat[1, 3:ncol(res$hapdat)] <- FSdidstable
  res$hapdat[2:nrow(res$hapdat), 1:2] <-
    hbresults$hapdos[, match(parents[FSnr,], colnames(hbresults$hapdos))]
  for (i in seq_along(FSdidstable)) {
    matcol <- which.max(hbresults$mrkdids == names(FSdidstable)[i])
    if (length(matcol) == 1) #i.e. not for mrkdid <NA>
      res$hapdat[2:nrow(res$hapdat), i+2] <- hbresults$hapdos[,matcol]
  }
  res$hapdat <- res$hapdat[rowSums(res$hapdat, na.rm=TRUE) > 0,]

  # third matrix hapcomp: composition of all haplotypes in res$hapdat
  hapnrs <- split_hapnames(hapnames=rownames(res$hapdat)[-1])$hapnrs
  r <- t(allHaplotypes(markers))
  colnames(r) <- seq_len(ncol(r))
  res$usedhap <- r[, hapnrs, drop=FALSE]
  res
} #showOneFS

#TODO:
# - impute missing marker dosages (1) in FS (2) in other?

getHaploblockResults <- function(hapresults, haploblock) {
  # Internal function used by allhap and usedhap.
  # if hapresults is already the results of a single haploblock and
  # haploblock not specified, hapresults itself is resturned.
  # Else hapresults is the list of results for all haploblocks, and the results
  # of the one indicated by haploblock is returned
  if (missing(haploblock) || is.null(haploblock)) {
    if (names(hapresults)[1] != "hapdos") {
      stop("haploblock must be specified")
    } else {
      # hapresults is already the results of a single haploblock
      return(hapresults)
    }
  }
  #haploblock specified, hapresults must be a list of results for all haploblocks
  if (length(haploblock) != 1) {
    stop("only one haploblock allowed")
  }
  if (names(hapresults)[1] == "hapdos") {
    stop("hapresults must be a list as returned by inferHaplotypes")
  }
  if (is.numeric(haploblock)) {
    if (!(haploblock %in% seq_along(hapresults)))
      stop(paste("invalid haploblock number:", haploblock))
    hbnr <- haploblock
  } else {
    if (!(haploblock %in% names(hapresults)))
      stop(paste("haploblock", haploblock, "not found"))
    hbnr <- which(names(hapresults) == haploblock)
  }
  hapresults[[hbnr]]
} #getHaploblockResult


#'@title Find all possible haplotypes
#'@description Find all possible haplotypes for a haploblock from the
#'haplotyping result
#'@usage allhap(hapresults, haploblock)
#'@param hapresults list as returned by inferhaplotypes, or one element
#'of such a list (i.e. the results for one haploblock)
#'@param haploblock if hapresults is one element of the return value of
#'inferHaplotypes, haploblock should be missing of NULL; else haploblock is
#'a single value indicating the haploblock: either its name of its index
#'in hapresults
#'@details This function works with the results of inferHaplotypes; the setting
#'of dropUnused does not affect this function
#'@return an array with all possible haplotypes. The haplotypes are in columns,
#'with the haplotype numbers as colnames; the markers are in rows.
#'@examples
#'data(PolyHaplotyper_small)
#'# show the composition of all possible haplotypes with the markers
#'# in the first haploblock:
#'allhap(hapresults=phresults, haploblock=1)
#'@export
allhap <- function(hapresults, haploblock) {
  hapresults <- getHaploblockResults(hapresults, haploblock)
  r <- t(allHaplotypes(hapresults$markers))
  colnames(r) <- seq_len(ncol(r))
  r
} # allhap

#'@title Find all used (inferred) haplotypes
#'@description Find all haplotypes for a haploblock that were inferred to be
#'present in the population (i.e. all haplotypes used for haplotyping any of
#'the individuals)
#'@usage usedhap(hapresults, haploblock)
#'@param hapresults list as returned by inferhaplotypes, or one element
#'of such a list (i.e. the results for one haploblock)
#'@param haploblock if hapresults is one element of the return value of
#'inferHaplotypes, haploblock should be missing of NULL; else haploblock is
#'a single value indicating the haploblock: either its name of its index
#'in hapresults
#'@details This function works with the results of inferHaplotypes; the setting
#'of dropUnused does not affect this function
#'@return an array with the haplotypes that are used in the
#'population. The haplotypes are in columns, with the haplotype numbers as
#'colnames; the markers are in rows.
#'@examples
#'data(PolyHaplotyper_small)
#'# show the composition of haplotypes inferred to be present
#'# in the first haploblock:
#'usedhap(hapresults=phresults, haploblock=1)
#'@export
usedhap <- function(hapresults, haploblock) {
  hapresults <- getHaploblockResults(hapresults, haploblock)
  hapnrs <-
    split_hapnames(hapnames=rownames(hapresults$hapdos)[
                       rowSums(hapresults$hapdos, na.rm=TRUE) > 0])$hapnrs
  # this takes care of the situation with dropUnused FALSE
  allhap(hapresults)[, hapnrs]
} # usedhap

#'@title merge replicate samples in dosage matrix
#'@description merge replicate samples in dosage matrix
#'@usage mergeReplicates(mrkDosage, replist, solveConflicts=TRUE)
#'@param mrkDosage a dosage matrix with markers in rows and individuals in columns.
#'row names are marker names, column names are individual names.
#'@param replist a list of character vectors, each of which has the sample names
#'of a set of replicates
#'@param solveConflicts if TRUE (default) and there are conflicting dosage
#'assignments between replicates for the same marker, the one with highest
#'frequency is used, provided the total freq of other dosages = 1 OR
#'<= 10\% of the frequency of the most frequent dosage. If
#'solveConflicts is FALSE and there are conflicting dosages, the consensus for
#'that marker will be NA
#'@details This function merges all sets of replicates, each to one column.
#'The column name of the one retained column is the first one for that set
#'in replist.\cr
#'For each set of replicates it calls getConsensusmrkDosage
#'@return a version of mrkDosage in which only one column of each set of replicates
#'is retained; this column (the first in its set as specified in replist)
#'now has the consensus scores over all replicates. Also, if mrkDosage was a
#'data.frame, it is converted into a matrix.
#'@examples
#'# construct a dosage matrix with some missing data:
#'dosmat <-
#'  matrix(c(rep(c(3,0,1),3), rep(c(1,1,2),4)), nrow=3,
#'         dimnames=list(c("mrk1","mrk2","mrk3"),
#'                       c("a1","a2","a3", "b1","b2","b3","b4")))
#'ix <- matrix(c(1,1, 3,1, 2,2, 1,3, 2,4, 1,5, 2,6), ncol=2, byrow=TRUE)
#'dosmat[ix] <- NA
#'dosmat
#'# define 2 sets of replicates:
#'reps <- list(c("a1","a2","a3"), c("b1","b2","b3","b4"))
#'# merge:
#'mergeReplicates(mrkDosage=dosmat, replist=reps)
#'# introduce a conflicting dosage:
#'dosmat[3,2] <- 2
#'# merge:
#'mergeReplicates(mrkDosage=dosmat, replist=reps)
#'@export
mergeReplicates <- function(mrkDosage, replist, solveConflicts=TRUE) {
  mrkDosage <- checkmrkDosage(mrkDosage)
  if (is.null(colnames(mrkDosage)))
    colnames(mrkDosage) <- as.character(seq_len(ncol(mrkDosage)))
  if (!is.list(replist)) replist <- list(replist)
  for (p in seq_along(replist)) {
    replist[[p]] <- as.character(replist[[p]])
    consdos <- getConsensusmrkDosage(mrkDosage=mrkDosage,
                                      indiv=replist[[p]],
                                      solveConflicts=solveConflicts)
    mrkDosage[, replist[[p]][1]] <- consdos #1st sample gets the consensus
    delcol <- colnames(mrkDosage) %in% replist[[p]][-1]
    mrkDosage <- mrkDosage[, !delcol] #delete cols of all other samples in this set
  }
  mrkDosage
} #mergeReplicates

# not exported: user function is mergeReplicates
#_#'@title get consensus marker dosages from one or more samples
#_#'@description get consensus marker dosages from one or more samples
#_#'@usage getConsensusmrkDosage(mrkDosage, indiv, solveConflicts=TRUE)
#_#'@param mrkDosage a dosage matrix with markers in rows and individuals in columns.
#_#'row names are marker names, column names are individual names.
#_#'@param indiv character vector with the names of the samples from which to
#_#'obtain consensus scores
#_#'@param solveConflicts if TRUE (default) and there are conflicting dosage
#_#'assignments between the samples for the same marker, the one with highest
#_#'frequency is used, provided the total freq of other dosages  <= 1 OR
#_#'<= 10\% of the frequency of the maximum dosage. If
#_#'solveConflicts is FALSE and there are conflicting dosages, the consensus for
#_#'that marker will be NA
#_#'@details getConsensusmrkDosage is primarily used by mergeReplicates
#_#'but can also be used by itself
#_#'@return a vector with one consensus dosage for each row of mrkDosage, and the
#_#'row names of mrkDosage as names
getConsensusmrkDosage <- function(mrkDosage, indiv, solveConflicts=TRUE) {
  indiv <- as.character(indiv)
  if (is.null(colnames(mrkDosage)))
    colnames(mrkDosage) <- as.character(seq_len(ncol(mrkDosage)))
  if (is.vector(mrkDosage)) mrkDosage <- matrix(mrkDosage, ncol=1)
  if (!is.matrix(mrkDosage)) stop("mrkDosage must be a matrix")
  if (!all(indiv %in% colnames(mrkDosage)))
    stop("not all indiv in mrkDosage")
  mrkDosage <- mrkDosage[,colnames(mrkDosage) %in% indiv, drop=FALSE]
  suppressWarnings({ #suppress warnings about all dosages NA
    mindos <- apply(mrkDosage, MARGIN=1, FUN=min, na.rm=TRUE)
    maxdos <- apply(mrkDosage, MARGIN=1, FUN=max, na.rm=TRUE)
  })
  if (solveConflicts) {
    for (i in which(mindos != maxdos)) {
      tb <- table(mrkDosage[i,]) #integer(0) is all scores NA
      suppressWarnings({ maxtb <- which(tb == max(tb)) })
      if (length(maxtb) == 1) { # 0 if all scores NA
        othercount <- sum (!is.na(mrkDosage[i,])) - tb[maxtb]
        if (othercount > 1 && othercount > 0.1 * tb[maxtb]) {
          mindos[i] <- NA
        } else mindos[i] <- as.integer(names(tb)[maxtb])
      } else mindos[i] <- NA
    }
  } else {
    #not solve conflicts, set any marker with conflicting dosages to NA:
    mindos[mindos != maxdos] <- NA #includes rows with all NA because -Inf != Inf
  }
  names(mindos) <- row.names(mrkDosage)
  mindos
} #getConsensusmrkDosage

# not exported
#_#'@title get all FS haplotype combinations expected from two parental haplotype
#_#'combinations
#_#'@description get all FS haplotype combinations expected from two parental
#_#'haplotype combinations, no considering DR, and without their frequencies
#_#'@usage getFScombs(parcomb)
#_#'@param parcomb matrix with one column for each parent and <ploidy> rows,
#_#'giving the haplotype combinations of the parents
#_#'@return a matrix with <ploidy> rows and one column for each
#_#'haplotype combination that can be generated from these two parents,
#_#'assuming polysomic inheritance with no double reduction.\cr
#_#'The colnames are NULL: haplotype combinations are not named.
getFScombs <- function(parcomb) {
  gamP1 <-   unique(getAllGametes(parcomb[, 1]), MARGIN=2)
  gamP2 <-   unique(getAllGametes(parcomb[, 2]), MARGIN=2)
  FScomb <- matrix(NA_integer_, ncol=ncol(gamP1) * ncol(gamP2),
                   nrow=nrow(parcomb))
  colnames(FScomb) <- 1:ncol(FScomb)
  for (g1 in seq_len(ncol(gamP1))) for (g2 in seq_len(ncol(gamP2))) {
    FScol <- (g1-1) * ncol(gamP2) + g2
    FScomb[, FScol] <-
      sort(c(gamP1[,g1], gamP2[,g2]))
  }
  unique(FScomb, MARGIN=2)
} #getFScombs

getAllGametes <- function (hapcomb) {
  #hapcomb: vector of the haplotype combination of a parent with <ploidy>
  #        elements
  #result: matrix with (ploidy/2) rows and one column per gamete,
  #        each element is the number (ID) of a haplotype). Within columns the
  #        haplotypes are sorted from low to high; the columns are ordered
  #        from left to right, first on row 1, then on row 2 etc (therefore
  #        duplicated columns are consecutive)
  # get from hapcomb a vector of length ploidy with all haplotypes present:
  #get the matrix with gametes:
  ploidy <- length(hapcomb)
  combn(hapcomb, ploidy/2)
} #getAllGametes

getHapcombFreq <- function(HapcombMatrix) {
  #takes a matrix, sorts it such that all columns that are identical are
  #consecutive, and returns a list with 2 elements:
  #hapcomb: a matrix like HapcombMatrix, with only unique columns
  #hcfreq: the number of times each column occurs in HapcombMatrix
  #(utility function, works with all types of matrices)
  o <- do.call("order",
               lapply(1:nrow(HapcombMatrix), function(i) HapcombMatrix[i,]))
  HapcombMatrix <- HapcombMatrix[, o, drop=FALSE]
  du <- duplicated(HapcombMatrix, MARGIN=2)
  uniqcomb <- which(!du)
  hapcounts <- c(uniqcomb[-1], length(du)+1) - uniqcomb
  list(hapcomb=HapcombMatrix[, !du, drop=FALSE], freq=hapcounts)
} #getHapcombFreq

#'@title get all gametes and their frequencies for a parental
#'haplotype combination
#'@description get all gametes and their frequencies for a parental
#'haplotype combination
#'@usage getGameteFreqs(parhac, DRrate)
#'@param parhac vector of the parental haplotype combination of length <ploidy>,
#'giving the <ploidy> haplotype numbers present
#'per haplotype giving the dosage of that haplotype, summing to ploidy
#'@param DRrate the rate of double reduction per meiosis (NOT per allele!); e.g.
#'with a DRrate of 0.04, a tetraploid parent with genotype ABCD will produce
#'a fraction of 0.04 of DR gametes AA, BB, CC and DD (each with a frequency of
#'0.01), and a fraction of 0.96 of the non-DR gametes AB, AC, AD, BC, BD, CD
#'(each with a frequency of 0.16)
#'@details for hexaploids the DR gametes consist of a duplication of one of the
#'6 parental alleles, combined with one copy of one of the other 5 alleles.\cr
#'Calculation is faster if DRrate is 0.0
#'@return a list of 2 elements:\cr
#'hapcomb: a matrix with one column per unique gamete and ploidy/2 rows.
#'Each element is the number (ID) of a haplotype). Within columns the
#'haplotypes are sorted from low to high; the columns are ordered from left
#'to right, first on row 1, then on row 2 etc\cr
#'freq: a vector of length ncol(hapcomb), with for each gamete in hapcomb its
#'frequency (0.0 ... 1.0)
#'@examples
#'# specify combination of haplotypes in a tetraploid parent:
#'hapcomb <- c(2,2,5,6) # 2 copies of haplotype 2, 1 each of 5 and 6
#'# gamete frequencies without double reduction:
#'getGameteFreqs(parhac=hapcomb, DRrate=0)
#'# gamete frequencies with 5\% double reduction:
#'getGameteFreqs(parhac=hapcomb, DRrate=0.05)
#'@export
getGameteFreqs <- function(parhac, DRrate) {
  noDR <- getHapcombFreq(getAllGametes(parhac))
  noDR$freq <- (1.0-DRrate) * noDR$freq / sum(noDR$freq)
  #get DR gametes:
  if (DRrate == 0.0) noDR else {
    # calculate the gamete freqs in case of DR:
    # make a matrix in which each of the columns will be a gamete
    ploidy <- length(parhac)
    if (ploidy==4) {
      # all DR gametes are just 1 of the 4 alleles duplicated
      haps <- unique(parhac)
      hapfrq <- table(parhac)
      DR <- list(hapcomb=rbind(haps, haps), freq=hapfrq)
      rownames(DR$hapcomb) <- NULL
    } else {
      # In a hexaploid a DR gamete contains one duplicated allele
      # combined with any of the other 5 alleles.
      # This does not generalize to ploidy 8 and higher because then there
      # may be more than one quadrivalent, and each quadrivalent can produce
      # a DR, i.e. each gamete may have more than one duplicated haplotype.
      # However, in an octo- or decaploid the chances of having two
      # quadrivalents, each producing a DR are small (about DRrate squared)
      # so we ignore that.
      x <- choose((ploidy-1), (ploidy/2-2)) #nr of gametes with one specific
      #                                      allele DR
      HMx <- matrix(integer(0), nrow=ploidy/2,
                    ncol=ploidy*x)
      for (d in 1:ploidy) {
        cols <- ((d-1) * x + 1) : (d * x)
        HMx[1:2, cols] <- parhac[d]
        HMx[3:(ploidy/2), cols] <- combn(parhac[-d], ploidy/2-2)
      }
      HMx <- apply(HMx, 2, sort) #sort each column separately
      HMx <- HMx[, order(HMx[1,], HMx[2,], HMx[3,])] #reorder all columns
      # HMx is now sorted, and comparable to that from getAllGametes
      DR <- getHapcombFreq(HMx)
    }
    # combine the noDR and DR gametes:
    DR$freq <- DRrate * DR$freq / sum(DR$freq)
    freq <- c(noDR$freq, DR$freq)
    hapcomb <- cbind(noDR$hapcomb, DR$hapcomb)
    hapcomb <- hapcomb[, freq>0, drop=FALSE]
    freq <- freq[freq>0]
    #find the order to sort the hapcomb columns (on all 2 or 3 rows):
    o <- do.call("order", lapply(1:nrow(hapcomb), function(i) hapcomb[i,]))
    hapcomb <- hapcomb[, o]
    freq <- freq [o]
    du <- duplicated(hapcomb, MARGIN=2)
    uniqcomb <- which(!du)
    #sum the freqs per set of duplicates:
    uniqfreq <- numeric(length(uniqcomb))
    uhx <- c(uniqcomb, length(du)+1)
    for (u in seq_along(uniqfreq))
      uniqfreq[u] <- sum(freq[uhx[u]:(uhx[u+1]-1)])
    list(hapcomb=hapcomb[, uniqcomb, drop=FALSE], freq=uniqfreq)
  }
} #getGameteFreqs

#'@title get all FS haplotype combinations expected from two parental haplotype
#'combinations, with their frequencies
#'@description get all FS haplotype combinations expected from two parental
#'haplotype combinations, with their frequencies. Different from getFScombs in
#'that it returns the unique FS combinations as well as their frequencies, and
#'that it also considers Double Reduction
#'@usage getFSfreqs(parhac, DRrate)
#'@param parhac matrix with one column for each parent and <ploidy> rows,
#'giving the haplotype combinations for each parent
#'@param DRrate the rate of double reduction per meiosis (NOT per allele!); e.g.
#'with a DRrate of 0.04, a tetraploid parent with genotype ABCD will produce
#'a fraction of 0.04 of DR gametes AA, BB, CC and DD (each with a frequency of
#'0.01), and a fraction of 0.96 of the non-DR gametes AB, AC, AD, BC, BD, CD
#'(each with a frequency of 0.16)
#'@return a list of 2 elements:\cr
#'FShac: a matrix with one column per unique FS haplotype combination and
#'the same row count as parhac, giving the FS haplotype combinations. There are
#'no duplicated columns but several (not necessarily adjacent) columns
#'may correspond to the same mrkdid\cr
#'freq: a vector of length ncol(hapcomb), with for each FS haplotype combination
#'in hapcomb its frequency (0.0 ... 1.0). The colnames are NULL: haplotype
#'combinations are not named.
#'@examples
#'# specify combinations of haplotypes in two tetraploid parents:
#'hapcomb <- matrix(c(2,2,5,6, 1,2,5,5), ncol=2)
#'# FS frequencies without double reduction:
#'getFSfreqs(parhac=hapcomb, DRrate=0)
#'# FS frequencies with 5\% double reduction:
#'getFSfreqs(parhac=hapcomb, DRrate=0.05)
#'@export
getFSfreqs <- function(parhac, DRrate) {
  gamP1 <-   getGameteFreqs(parhac[, 1], DRrate)
  gamP2 <-   getGameteFreqs(parhac[, 2], DRrate)
  getFSfreqs_gamfrq(gamP1, gamP2)
}

getFSfreqs_gamfrq <- function(gamP1, gamP2) {
  #function that does most of the actual work of getFSfreqs;
  #also called by checkPedigree
  #gamP1, gamP2: result of getGameteFrqs for two parents and specified DRrate
  #return value: see getFSFreqs
  FScomb <- matrix(NA_integer_, ncol=ncol(gamP1$hapcomb) * ncol(gamP2$hapcomb),
                   nrow=nrow(gamP1$hapcomb) + nrow(gamP2$hapcomb))
  FSfreq <- numeric(ncol(FScomb))
  colnames(FScomb) <- 1:ncol(FScomb)
  for (g1 in seq_len(ncol(gamP1$hapcomb)))
    for (g2 in seq_len(ncol(gamP2$hapcomb))) {
    FScol <- (g1-1) * ncol(gamP2$hapcomb) + g2
    FScomb[, FScol] <-
      sort(c(gamP1$hapcomb[,g1], gamP2$hapcomb[,g2]))
    FSfreq[FScol] <- gamP1$freq[g1] * gamP2$freq[g2]
  }
  #find the order to sort the FScomb columns (on all rows):
  o <- do.call("order", lapply(1:nrow(FScomb), function(i) FScomb[i,]))
  FScomb <- FScomb[, o, drop=FALSE]
  FSfreq <- FSfreq [o]
  du <- duplicated(FScomb, MARGIN=2)
  uniqcomb <- which(!du)
  #sum the freqs per set of duplicates:
  uniqfreq <- numeric(length(uniqcomb))
  uhx <- c(uniqcomb, length(du)+1)
  for (u in seq_along(uniqfreq))
    uniqfreq[u] <- sum(FSfreq[uhx[u]:(uhx[u+1]-1)])
  #remove the duplicated columns and the column names:
  FScomb <- FScomb[, uniqcomb, drop=FALSE]
  colnames(FScomb) <- NULL
  list(FShac=FScomb, freq=uniqfreq)
} #getFSfreqs_gamfrq

checkFS_TwoParents <- function(pargamlist, FShac) {
  # pargamlist: a list with two elements, one for each parent. Each element
  #             is itseld a list with two elements named $noDR and $DR; each of
  #             these is a list as returned by getGameteFreqs, the noDR obtained
  #             with DRrate=0 and the other with DRrate>0 (say 0.04, we only
  #             need the set of possible gametes with DR; their freq (and so
  #             the exact DRrate) is not used).
  # FShac: matrix with ploidy rows and one column for each FS indiv, does
  #        not contain NAs: the haplotype combinations of the FS indiv.
  #        FShac is sorted such that identical columns are adjacent
  nrows <- sapply(pargamlist, length)
  if (max(nrows) != min(nrows)) stop("nrows differ")
  matchparents <- matrix(FALSE, nrow=nrows[1], ncol=ncol(FShac),
                         dimnames=list(NULL, colnames(FShac)))
  if (!is.null(names(pargamlist[[1]])))
    rownames(matchparents) <- names(pargamlist[[1]])
  # matchparents is the return value: for each FS indiv it is TRUE if it
  # is a possible progeny of the parents, without and with assuming DR,
  # and else FALSE. Cannot be NA as all FS indiv with missing hac were already
  # excluded from FShac
  du <- duplicated(FShac, MARGIN=2)
  uniqcomb <- which(!du)
  #sum the freqs per set of duplicates:
  uhx <- c(uniqcomb, length(du)+1)
  uniqfreq <- diff(uhx)
  #now we have the inferred haplotype combinations in FShac[, uniqcomb] and
  #their frequencies in uniqfreq
  #next we determine how many of these are expected without and with DR:
  for (m in seq_len(nrow(matchparents))) { #modes: 1 = no DR, 2 = with DR
    FSexp <- getFSfreqs_gamfrq(pargamlist[[1]][[m]],
                               pargamlist[[2]][[m]])
    for (hc in seq_along(uniqcomb)) {
      indiv <- colnames(FShac)[uhx[hc]:(uhx[hc+1]-1)]
      alleq <- function(x, y) { isTRUE(all.equal(x, y)) }
      #        identical returns FALSE, all.equal TRUE for a matching column,
      #        but all.equal returns a character for a non-matching column
      matchcols <- apply(FSexp$FShac, MARGIN=2,
                         FUN=alleq, FShac[, uniqcomb[hc]])
      # so, for each expected hac (each column of FSexp$FShac) we get
      # a TRUE or FALSE result on whether it matches the currently
      # studied observed hapcomb (FShac[, uniqcomb[hc]])
      if (any(matchcols)) {
        matchparents[m, indiv] <- TRUE
      }
    } #for hc
  } # for m
  matchparents
} # checkFS_TwoParents

checkFS_OneParent <- function(pargamlist, FShac, nhap) {
  # pargamlist: a list with two elements named $noDR and $DR; each a list
  #             as returned by getGameteFreqs, the noDR obtained with DRrate=0
  #             and the other with DRrate>0 (say 0.04, we only need the set of
  #             possible gametes with DR; their freq (and so the exact
  #             DRrate) is not used). Or a list with only one such element.
  # FShac: matrix with ploidy rows and one column for each FS indiv, does
  #        not contain NAs: the haplotype combinations of the FS indiv.
  #        FShac is sorted such that identical columns are adjacent
  # nhap: the total number of possible haplotypes in this haploblock
  matchparent <- matrix(FALSE, nrow=length(pargamlist), ncol=ncol(FShac),
                        dimnames=list(NULL, colnames(FShac)))
  if (!is.null(names(pargamlist)))
    rownames(matchparent) <- names(pargamlist)
  # matchparents is the return value: for each FS indiv it is TRUE if it
  # is a possible progeny of the parent, without and with assuming DR,
  # and else FALSE. Cannot be NA as all FS indiv with missing hac were already
  # excluded from FShac
  du <- duplicated(FShac, MARGIN=2)
  uniqcomb <- which(!du)
  uniqhad <- hapcomb2hapdos(FShac[, uniqcomb, drop=FALSE], nhap)
  #sum the freqs per set of duplicates:
  uhx <- c(uniqcomb, length(du)+1)
  uniqfreq <- diff(uhx) #uhx[-1] - uhx[-length(uhx)]
  #now we have the inferred haplotype combinations and their frequencies;
  #next we determine how many of these are expected without and with DR.
  #we have only the expected gametes of par1; a hapcomb in HS can only
  #occur if there is at least one gamete of par1 that has less than or
  #equal the nr of copies of all haplotypes
  for (m in seq_along(pargamlist)) { #modes: 1 = no DR, 2 = with DR
    expgamhad <- hapcomb2hapdos(pargamlist[[m]]$hapcomb, nhap)
    for (hc in seq_along(uniqcomb)) {
      indiv <- colnames(FShac)[uhx[hc]:(uhx[hc+1]-1)]
      smeq <- function(x, y) { all(x <= y) }
      matchcols <- apply(expgamhad, MARGIN=2,
                         FUN=smeq, uniqhad[, hc]) #0, 1 or more TRUE
      # so, for each expected gamete (each column of expgamhad) we get
      # a TRUE or FALSE result on whether it could have participated
      # in currently studied observed FShad (uniqhad[, hc])
      if (any(matchcols)) {
        #indivs with this uniqhad match with the parent
        matchparent[m, indiv] <- TRUE
      }
    } # for hc
  } # for m
  matchparent
} #checkFS_OneParent


##!title check the consistency of a single haploblock genotypes over a pedigree
##!description For one haploblock, check whether the inheritance of inferred
##!haplotypes over the pedigree is consistent (without or with DR, and assuming
##!polysomic inheritance)
##!usage pedigreeHapCheck_OneHaploblock(ped, parentnames, mrkDosage, hapcomb,
##!nhap)
##!param ped data.frame or matrix listing the pedigree. Column 1 has names of
##!all individuals (no duplicates, no NA), column 2 and 3 have the parents
##!(duplicates and NA allowed, also individuals with one known parent or
##!with two identical parents allowed). No sorting of the pedigree is needed but
##!all parents must also be listed as individuals
##!param parentnames names of all individuals that occur as parents in ped
##!(must be correct, not checked, serves to avoid double work)
##!param mrkDosage a matrix with one row for each marker in the haploblock
##!and one column for each individual, with the observed dosages. Individuals
##!with one or more NA dosages are considered to have no marker data. Individual
##!names are the colnames, no duplicates allowed; the set of individuals
##! need not be the same as those in ped; individual names common to mrkDosage and
##! ped must be identical
##!param hapcomb a matrix with <ploidy> rows and one column for each individual,
##!with the inferred haplotype combinations per individual. Individual names are
##!the colnames, the set of individuals need not be the same as those in ped;
##!individual names common to hapcomb and ped must be identical. hapcomb has
##!no missing values: all individuals with no assigned hapcomb were omitted.
##!return a list with two items:\cr
##! $ped: the same pedigree as the input, with two additional columns
##! matchParents_noDR and matchParents_withDR. FALSE indicates a conflict with
##! (one of) the parents, FALSE no conflict, NA if individual or both its
##! parents NA.
##! $parents: a data frame with one row for each individual that occurs as a
##! parent, with some statistics about all its progeny and its match with all
##! its progeny
# don't export
pedigreeHapCheck_OneHaploblock <- function(ped, parentnames, mrkDosage,
                                           hapcomb, nhap, nohapdata,
                                           imputedGeno) {
  # - for each parent in the pedigree generate all its gametes
  #   * without and with DR
  # - sort the pedigree into FS families
  # - for each (reciprocal) cross generate all its possible progeny
  #   * without and with DR
  # - tabulate the inferred haplotype combinations for each FS family
  #   and check for each if it is expected without and with DR
  # if nohapdata is TRUE the arrays in the result are filled with NAs or zeroes
  # as appropriate
  # Result: a list with 2 elements:
  # $ped: a logical matrix with one row per individual and 5 columns:
  #       mrk, imp, hap, noDR, withDR: is it genotyped, imputed, haplotyped,
  #       does it conflict with its parents without or with allowing DR
  # $parents: an integer matrix with one row per individuals that is a parent in
  #       ped, and columns par_mrkdata, par_hapdata, totprogeny, prog_mrkdata,
  #       prog_hapdata, nonDRmatch, DRmatch: first 2 are 0/1 (is parent
  #       genotyped and haplotyped), then the nr of direct progeny and the nr
  #       of progeny that is genotyped and haplotyped, and finally the nr of
  #       progeny that matches (can be produced) without or with allowing DR
  peddat <- matrix(NA, nrow=nrow(ped), ncol=5,
                   dimnames=list(ped[,1], c("mrk", "imp", "hap", "noDR", "withDR")))
  mode0_ped <- which(colnames(peddat) == "noDR") - 1
  peddat[, "mrk"] <- #mrk: complete marker dosage data for this indiv in this haploblock?
    !is.na(colSums(mrkDosage[, match(rownames(peddat), colnames(mrkDosage)), drop=FALSE]))
  if (is.null(imputedGeno)) peddat[, "imp"] <- FALSE else
  peddat[, "imp"] <- #imp: was the marker genotype of this indiv imputed for this block?
    rownames(peddat) %in% colnames(imputedGeno)
  parcols <- c("par_mrkdata", "par_hapdata", "totprogeny",
               "prog_mrkdata", "prog_hapdata", "nonDRmatch", "DRmatch")
  parents <- matrix(as.integer(0), nrow=length(parentnames), ncol=length(parcols),
                    dimnames=list(parentnames, parcols))
  mode0_par <- which(parcols == "nonDRmatch") - 1 # base column number for modes
  parents[, "par_mrkdata"] <-
    !is.na(colSums(mrkDosage[, match(parentnames, colnames(mrkDosage)), drop=FALSE]))

  if (nohapdata) {
    peddat[, "hap"] <- 0
    parents[, c("par_hapdata", "prog_hapdata")] <- 0
  } else {
    if (any(is.na(hapcomb[1,])))
        stop("hapcomb should not be empty and not contain missing data")
    # sort hapcomb such that identical columns are adjacent:
    # (hapcomb only contains individuals that have a haplotype combination
    # assigned, i.e. no NAs)
    o <- do.call("order", lapply(1:nrow(hapcomb), function(i) hapcomb[i,]))
    hapcomb <- hapcomb[, o, drop=FALSE]
    peddat[, "hap"] <- #hap: assigned haplotype combi for this indiv?
      rownames(peddat) %in% colnames(hapcomb) #no test for NAs needed: not present
    parentsNodata <- setdiff(parentnames, colnames(hapcomb))
    parents[, "par_hapdata"] <-
      parentnames %in% colnames(hapcomb) #hapcomb has only haplotyped individuals
    # first create a list with the possible gametes from each parent:
    pargamlist <- list()
    for (par in parentnames) {
      if (par %in% parentsNodata) {
        pargamlist[[par]] <- "nodata"
      } else {
        pargamlist[[par]] <- list()
        pargamlist[[par]]$noDR <-
          getGameteFreqs(parhac=hapcomb[, par], DRrate=0)
        pargamlist[[par]]$DR <-
          getGameteFreqs(parhac=hapcomb[, par], DRrate=0.04)
        #  we only need the set of possible gametes; their freq (and so the exact
        #  DRrate) is not used
      }
    }
    #for each fullsib family, get all possible progeny haplotype combinations
    #and check the inferred ones against these:
    for (p1 in seq_along(parentnames)) {
      par1 <- parentnames[p1]
      for (p2 in p1:nrow(parents)) {
        par2 <- parentnames[p2]
        FS <- ped[!is.na(ped[, 2]) & ped[, 2] %in% c(par1, par2) &
                    !is.na(ped[, 3]) & ped[, 3] %in% c(par1, par2),, drop=FALSE]
        if (par1 != par2) FS <- FS[FS[, 2] != FS[, 3],, drop=FALSE]
        if (nrow(FS) > 0) {
          # this parental combination occurs in ped
          parents[c(p1, p2), "totprogeny"] <-
            parents[c(p1, p2), "totprogeny"] + nrow(FS)
          parents[c(p1, p2), "prog_mrkdata"] <-
            parents[c(p1, p2), "prog_mrkdata"] +
            sum(!is.na(colSums(mrkDosage[, colnames(mrkDosage) %in% FS[, 1],
                                         drop=FALSE])))
          FShac <- hapcomb[, colnames(hapcomb) %in% FS[, 1], drop=FALSE]
          parents[c(p1, p2), "prog_hapdata"] <-
            parents[c(p1, p2), "prog_hapdata"] + ncol(FShac)
          if (all(parents[c(p1, p2), "par_hapdata"] > 0) && ncol(FShac) > 0) {
            #both parents have haplotypes and at least some FS indiv in hapcomb
            chk <- checkFS_TwoParents(pargamlist=pargamlist[c(p1,p2)],
                                      FShac=FShac)
            pedrows <- match(colnames(chk), rownames(peddat))
            peddat[pedrows, mode0_ped+(1:2)] <- t(chk)
            parents[c(p1, p2), mode0_par+(1:2)] <-
              parents[c(p1, p2), mode0_par+(1:2)] +
              rep(rowSums(chk, na.rm=TRUE), each=2)
          }
        } # nrow(FS) > 0
      } # for p2
      #and for the individuals of which only one parent is known in ped and has
      #been haplotyped:
      HS <- ped[(!is.na(ped[, 2]) & ped[, 2] == par1 &
                   (is.na(ped[, 3]) | ped[, 3] %in% parentsNodata)) |
                (!is.na(ped[, 3]) & ped[, 3] == par1 &
                   (is.na(ped[, 2]) | ped[, 2] %in% parentsNodata)),, drop=FALSE]
      # includes all indiv with par1 as one parent and the other parent
      # unknown or with missing hapcomb
      if (nrow(HS) > 0) {
        # there are individuals in ped with p1 as only parent
        #
        # only the HS indivs without known second parent must be added to
        # the totals, the others (where the second parent is known but
        # has no assigned hapcomb) were already counted above (because the
        # FS were defined on the ped, which also includes parents without data)
        OneParent <- HS[rowSums(is.na(HS[, 2:3, drop=FALSE])) == 1,, drop=FALSE]
        parents[p1, "totprogeny"] <- parents[p1, "totprogeny"] + nrow(OneParent)
        parents[p1, "prog_mrkdata"] <-
          parents[p1, "prog_mrkdata"] +
          sum(!is.na(colSums(mrkDosage[, colnames(mrkDosage) %in% OneParent[, 1],
                                       drop=FALSE])))
        parents[p1, "prog_hapdata"] <-
          parents[p1, "prog_hapdata"] +
          sum(!is.na(hapcomb[1, colnames(hapcomb) %in% OneParent[, 1]]))
        # for the whole HS (all indiv with only one known parent plus
        # all indiv with two parents where the second parent has missing haplodata)
        # we calculate the nr that match with this parent (which was not done
        # earlier)
        HShac <- hapcomb[, colnames(hapcomb) %in% HS[, 1], drop=FALSE]
        if (parents[p1, 2] && ncol(HShac) > 0 ) {
          chk <- checkFS_OneParent(pargamlist=pargamlist[[p1]],
                                    FShac=HShac, nhap=nhap)
          pedrows <- match(colnames(chk), rownames(peddat))
          peddat[pedrows, mode0_ped+(1:2)] <- t(chk)
          parents[p1, mode0_par+(1:2)] <-
            parents[p1, mode0_par+(1:2)] + rowSums(chk, na.rm=TRUE)
        }
      } # nrow(HS) > 0
    } # for p1
  } # at least some indiv with haplotypes
  list(ped=peddat, parents=parents)
} #pedigreeHapCheck_OneHaploblock

#'@title check the consistency of haploblock genotypes over a pedigree
#'@description For all haploblocks, check whether the inheritance of inferred
#'haplotypes over the pedigree is consistent without or with allowing for
#'double reduction (DR), and assuming polysomic inheritance.
#'@usage pedigreeHapCheck(ped, mrkDosage, haploblock, hapresults)
#'@param ped data.frame or matrix listing the pedigree. Column 1 has names of
#'all individuals (no duplicates, no NA), column 2 and 3 have the parents
#'(duplicates and NA allowed, also individuals with one known parent or
#'with two identical parents allowed). No sorting of the pedigree is needed.
#'All parents should also be listed as individuals; if that is not the case
#'the missing parents will be added as founders and a warning will be issued.
#'@param mrkDosage a matrix with one row for each marker
#'and one column for each individual, with the observed dosages. Individuals
#'with one or more NA dosages are considered to have no marker data. Individual
#'names are the colnames, no duplicates allowed; the set of individuals
#'needs not be the same as those in ped; individual names common to mrkDosage
#'and ped must be exactly identical (upper/lower case and whitespace included)
#'@param haploblock a list of character vectors. The names are the names of the
#'haploblocks, the character vectors have the names of the markers in each
#'haploblock
#'@param hapresults a list as returned by inferHaplotypes, with one
#'item per haploblock, containing at least all those in the list
#'specified by haploblock
#'@return a list with two items:\cr
#'- ped_arr: a 3D logical array with dimensions individuals, diagnostics and
#'haploblocks. For each individual and each haploblock there are 4
#'diagnostics:\cr
#'  * mrk: does the individual have complete marker dosage data?\cr
#'  * imp: were the marker dosages for this individual imputed?\cr
#'  * hap: is there a haplotype combination assigned to the individual?\cr
#'  * noDR: does the haplotype genotype of the individual match that of its
#'  parents, assuming polysomic inheritance but no Double Reduction? NA if the
#'  individual or both its parents do not have a haplotype genotype assigned\cr
#'  * withDR: as noDR, but allowing Double Reduction\cr
#'- parents_arr: a 3D integer array with dimensions parents (all individuals
#'that occur as parents in the pedigree), diagnostics and haploblocks.
#'For each parent and each haploblock there are 7 diagnostics:\cr
#' * par_mrkdata: 0=FALSE, 1=TRUE, does this parent have complete marker data?\cr
#' * par_hapdata: 0=FALSE, 1=TRUE, does this parent have a haplotype
#' genotype assigned?\cr
#' * totprogeny: how many first-generation progeny (children) does this parent
#' have (combined over all its matings, both as mother and as father)\cr
#' * prog_mrkdata: how many progeny have complete marker dosage data?\cr
#' * prog_hapdata: how many progeny have a haplotype combination assigned?\cr
#' * nonDRmatch: how many progeny have a haplotype combination that is
#' compatible with their parent's haplotype combinations, assuming polysomic
#' inheritance but no Double Reduction\cr
#' * DRmatch: as nonDRmatch, but also allowing DR\cr
#'Both ped_arr and parents_arr contain all haploblocks in haploblock, also
#'those skipped because of too many markers and those without any haplotyped
#'individuals. These can be excluded by excluding them from the haploblock list.
#'@examples
#'data(PolyHaplotyper_small)
#'phchk <- pedigreeHapCheck(ped=phped, mrkDosage=phdos, haploblock=phblocks,
#'                          hapresults=phresults)
#'# show the top of the ped_arr for haploblock 1:
#'phchk$ped_arr[1:6,,1]
#'# show the top of the parents_arr for haploblock 1:
#'phchk$parents_arr[1:6,,1]

#'@export
pedigreeHapCheck <- function(ped, mrkDosage, haploblock,
                                hapresults) {
  hbnames <- names(haploblock)
  if (!all(hbnames %in% names(hapresults)))
    stop("not all haploblocks specified by haploblock occur in hapresults")
  hapresults <- hapresults[hbnames] # same length and same order as haploblock
  #which haploblocks were done (i.e. have a non-error result):
  hbislist <- sapply(hapresults, FUN=is.list)
  hbnomess <- sapply(hapresults, FUN=function(x) (is.null(x$message) || x$message==""))
  hbrows <- sapply(hapresults, FUN=function(x) (!is.null(x$hapdos) && nrow(x$hapdos) > 0)) # this last condition added by Giorgio
  hbdone <- hbislist & hbnomess & hbrows
  #check if all parents are also listed as individuals in pedigree:
  ped <- cbind(as.character(ped[,1]), as.character(ped[,2]),
                 as.character(ped[,3])) #works for data.frame and matrix
  parentnames <- sort(setdiff(c(ped[,2], ped[,3]), NA))
  missingpar <- !(parentnames %in% ped[,1])
  if (sum(missingpar) > 0) {
    message(paste(sum(missingpar),
                  "parents are not listed as individuals in ped;",
                  "added as founders"))
    addparents <- parentnames[missingpar]
    ped <- rbind(ped, cbind(addparents, NA, NA))
  }
  pedarr <- array(NA, dim=c(nrow(ped), 5, length(haploblock)),
                  dimnames=list(as.character(ped[,1]),
                                c("mrk", "imp", "hap", "noDR", "withDR"),
                                hbnames))
  pararr <- array(NA_integer_, dim=c(length(parentnames), 7, length(haploblock)),
                  dimnames=list(parentnames,
                                c("par_mrk", "par_hap", "prog_tot",
                                  "prog_mrk", "prog_hap",
                                  "nonDRmatch", "DRmatch"),
                                hbnames))
  for (h in seq_along(haploblock)) {
    hbname <- hbnames[h]
    cat(paste0("checking haploblock ", h, " of ", length(hbdone), ": ",
               hbname,"\n")); flush.console()
    if (hbdone[h]) {
      nhap <- 2 ^ length(hapresults[[hbname]]$markers)
      tmp <-
        hapresults[[hbname]]$hapdos[, !is.na(hapresults[[hbname]]$hapdos[1,]),
                                    drop=FALSE]
    } else {
      # not hbdone[h]
      tmp <- matrix(integer(0), ncol=0) # to signal that no individuals haplotyped
    }
    if (ncol(tmp) == 0) {
        cat(paste("haploblock ", h, ": no individuals without missing values\n"))
    } else {
        cs <- colSums(tmp)
        if (anyNA(cs)) stop(paste("haploblock ", h, ": inconsistent missing values"))
        # so tmp has 1 or more rows and cols and no missing values
        if (min(cs) != max(cs)) stop(paste("haploblock ", h, ": inconsistent haplotype dosages"))
        ploidy <- min(cs)
        tmp <- hapdos2hapcomb(hapdos=tmp, ploidy=ploidy)
    }
    if ("imputedGeno" %in% names(hapresults[[h]]))
      imputedGeno <- hapresults[[h]]$imputedGeno else imputedGeno <- NULL
    pedchk <- pedigreeHapCheck_OneHaploblock(ped=ped,
      parentnames=parentnames,
      mrkDosage=mrkDosage[rownames(mrkDosage) %in% haploblock[[hbname]],, drop=FALSE],
      hapcomb=tmp, nhap=nhap, nohapdata=!hbdone[h], imputedGeno=imputedGeno)
    pedarr[,, h] <- pedchk$ped
    pararr[,, h] <- pedchk$parents
  } # for h
  list(ped_arr=pedarr, parents_arr=pararr)
} #pedigreeHapCheck

#'@title generate an overview of the results by haploblock and by FS family
#'@description generate an overview of the results by haploblock and by
#'FS family
#'@usage overviewByFS(haploblock, parents, FS, hapresults)
#'@param haploblock a list of character vectors. The names are the names of the
#'haploblocks, the character vectors have the names of the markers in each
#'haploblock.
#'@param parents a matrix of two colums and one row per FS, containing the names
#'of the two parents of each FS
#'@param FS a list of character vectors with the names of the samples for
#'each FS
#'@param hapresults a list as returned by inferHaplotypes, with one
#'item per haploblock, containing at least all those in the list
#'specified by haploblock
#'@return a list with two items:\cr
#'- ovw: an integer matrix with one row per haploblock and the following
#'columns:\cr
#'  * nmrk: the number of markers in the haploblock\cr
#'  * nhap: the number of different haplotypes assigned over all individuals
#'    (NA if no solution was found for this haplotype; the reason for that is
#'    listed in the first column of item messages of the return value)\cr
#'  * for each FS family a set of 6 columns:\cr
#'  + parmrk (0, 1 or 2: the number of parents with complete marker data)\cr
#'  + fit (0=FALSE or 1=TRUE), indicating if a solution for the FS was found
#'    based on polysomic inheritance)\cr
#'  + mrk: the number of FS progeny with complete marker data\cr
#'  + imp: the number of FS progeny where complete marker data were imputed\cr
#'  + hap: the number of FS progeny with assigned haplotype combinations.
#'    hap will be less than the mrk value if the same FS marker genotype can be
#'    produced with different combinations of haplotypes that are all compatible
#'    with the parental haplotype combinations) or if some FS marker genotypes
#'    cannot be produced by the inferred parental haplotype combinations\cr
#'  + P: the chi-squared P-value of the best fitting solution, even if this is
#'    discarded because of lack of fit).\cr
#'  * For "rest" (all individuals that are not part of the FS's or their
#'  parents) and "all" (all individuals) there are also columns mrk and hap,
#'  and for "all" there is also a column imp, similar to those for the
#'  FS families. The numbers for "all" are the sums
#'  of those for the FS families, the FS parents (some FS may share a parent
#'  but shared parents are counted only one) and the "rest".\cr
#'- messages  : a character matrix with one row per haploblock and the following
#'columns:\cr
#'  * haploblock: the reason why there is no solution for the
#'  haploblock ("" if there is a solution)
#'  * one column for each FS family with a possible message or "". A message
#'  can indicate a failure to find a solution for the FS family but may also
#'  describe less significant problems, such as some progeny with unexpected
#'  marker dosages etc.
#'@examples
#'data(PolyHaplotyper_small)
#'overviewByFS(haploblock=phblocks, parents=phpar, FS=phFS,
#'             hapresults=phresults)
#'@export
overviewByFS <- function(haploblock, parents, FS, hapresults) {
  hbnames <- names(haploblock)
  if (!all(hbnames %in% names(hapresults)))
    stop("not all haploblocks specified by haploblock occur in hapresults")
  hapresults <- hapresults[hbnames] # same length and same order as haploblock
  nFS <- length(FS)
  if (nFS == 0) stop("no FS's")
  FS <- lapply(FS, as.character)
  parents <- matrix(as.character(parents), ncol=2)
  perFScols <- c("parmrk", "fit", "P", "mrk", "imp", "hap")
  lastFScol <- 2 + length(perFScols)*nFS
  FSsok <- TRUE
  ovw <- matrix(NA_integer_, nrow=length(hapresults),
                ncol=lastFScol + 5)
  rownames(ovw) <- names(hapresults)
  colnames(ovw) <- c("nmrk", "nhap",
                     paste0(perFScols,
                            rep(1:nFS, each=length(perFScols))),
                     "mrkrest", "haprest", "mrkall", "impall", "hapall")
  ovw[, 1] <- sapply(haploblock, length) # nmrk
  messages <- matrix("", nrow=length(hapresults), ncol=1 + nFS)
  rownames(messages) <- names(hapresults)
  colnames(messages) <- c("haploblock", paste0("FS", padded(1:nFS)))
  for (hb in seq_along(hapresults)) {
    if (hapresults[[hb]]$message != "") {
      messages[hb, 1] <- hapresults[[hb]]$message
    } else {
      FSsok <- FSsok && "FSfit" %in% names(hapresults[[hb]])
      if (FSsok)
        messages[hb, 2:(nFS+1)] <- hapresults[[hb]]$FSmessages
      ovw[hb, 2] <-
        sum(rowSums(hapresults[[hb]]$hapdos, na.rm=TRUE) > 0) #nhap
      totimputed <- 0
      for (fs in 1:nFS) {
        # parmrk: how many parents have complete marker data?
        basecol <- 3 + length(perFScols)*(fs-1)
        ovw[hb, basecol] <-
          sum(!is.na(hapresults[[hb]]$mrkdids[
            names(hapresults[[hb]]$mrkdids) %in% parents[fs,] ]))
        # imp: how many FS indiv have imputed mrk data?
        if ("imputedGeno" %in% names(hapresults[[hb]])) {
          ovw[hb, basecol+4] <-
            sum(colnames(hapresults[[hb]]$imputedGeno) %in% FS[[fs]])
          totimputed <- totimputed + ovw[hb, basecol+4]
        } else ovw[hb, basecol+4] <- 0
        # mrk: how many FS indiv have complete marker data?
        ovw[hb, basecol+3] <-
          sum(!is.na(hapresults[[hb]]$mrkdids[
            names(hapresults[[hb]]$mrkdids) %in% FS[[fs]] ])) -
          ovw[hb, basecol+4] # mrkdids present for original mrk and imputed
        # hap: how many FS indiv have inferred haplotype dosages?
        if (nrow(hapresults[[hb]]$hapdos) == 0) {
          ovw[hb, basecol+5] <- 0
        } else {
          hap <- which(!is.na(hapresults[[hb]]$hapdos[1,]))
          hap <- colnames(hapresults[[hb]]$hapdos)[hap]
          hap <- hap[hap %in% FS[[fs]]]
          ovw[hb, basecol+5] <-
            sum(!is.na(hapresults[[hb]]$hapdos[1,
              colnames(hapresults[[hb]]$hapdos) %in% FS[[fs]] ]))
        }
        if (FSsok) {
          # fit: was an FS solution found?
          ovw[hb, basecol+1] <- hapresults[[hb]]$FSfit[fs]
          # P: P-value for best segregation, even if rejected:
          ovw[hb, basecol+2] <- hapresults[[hb]]$FSpval[fs]
        }
      } # for fs
      #data for "rest": the non-FS, non-FS-parents indiv
      rest <- setdiff(names(hapresults[[hb]]$mrkdids),
                      c(unlist(FS), parents))
      # mrkrest: how many rest indiv with complete marker data
      ovw[hb, lastFScol + 1] <-
        sum(!is.na(hapresults[[hb]]$mrkdids[
          names(hapresults[[hb]]$mrkdids) %in% rest]))
      # haprest: how many rest indiv haplotyped
      if (nrow(hapresults[[hb]]$hapdos) == 0) {
        ovw[hb, lastFScol + 2] <- 0
      } else {
        ovw[hb, lastFScol + 2] <-
          sum(!is.na(hapresults[[hb]]$hapdos[1,
            colnames(hapresults[[hb]]$hapdos) %in% rest]))
      }
      #data for all indiv together:
      # mrkall: how many indiv with complete marker data
      ovw[hb, lastFScol + 3] <-
        sum(!is.na(hapresults[[hb]]$mrkdids)) - totimputed
      # impall: how many indiv with imputed mrk genotypes
      ovw[hb, lastFScol + 4] <- totimputed
      # hapall: how many indiv haplotyped
      if (nrow(hapresults[[hb]]$hapdos) == 0) {
        ovw[hb, lastFScol + 5] <- 0
      } else {
        ovw[hb, lastFScol + 5] <-
        sum(!is.na(hapresults[[hb]]$hapdos[1,]))
      }
    } # message == ""
  } # for hb
  if (!FSsok)
    warning("hapresults were calculated without specifying FS families")
  list(ovw=ovw, messages=messages)
} # overviewByFS

calcPedstats <- function(pedchk, indiv, haploblocks) {
  # function used only in calcStatistics; for input and output see there
  if (missing(indiv) || is.null(indiv))
    indiv <- dimnames(pedchk$ped_arr)[[1]]
  indix <- match(indiv, dimnames(pedchk$ped_arr)[[1]])
  if (length(indix) == 0 || anyNA(indix))
    stop("names in indiv not found in pedchk")
  if (missing(haploblocks) || is.null(haploblocks)) {
    haploblocks <- dimnames(pedchk$ped_arr)[[3]]
  } else if (is.numeric(haploblocks)) {
    haploblocks <- dimnames(pedchk$ped_arr)[[3]][haploblocks]
  } else if (is.list(haploblocks)) {
    haploblocks <- names(haploblocks)
  }
  if (!all(haploblocks %in% dimnames(pedchk$ped_arr)[[3]]))
    stop("not all haploblocks occur in pedchk")
  hbix <- match(haploblocks, dimnames(pedchk$ped_arr)[[3]])
  if (length(hbix) == 0 || anyNA(hbix))
    stop("no selected haploblocks present in pedchk")
  pedstats <- data.frame(
    haploblock = haploblocks,
    stringsAsFactors = FALSE
  )
  pedstats$mrk <- apply(pedchk$ped_arr[indix, "mrk", hbix, drop=FALSE],
                        MARGIN=3, FUN=sum, na.rm=TRUE)
  pedstats$imp <- apply(pedchk$ped_arr[indix, "imp", hbix, drop=FALSE],
                        MARGIN=3, FUN=sum, na.rm=TRUE)
  pedstats$hap <- apply(pedchk$ped_arr[indix, "hap", hbix, drop=FALSE],
                        MARGIN=3, FUN=sum, na.rm=TRUE)
  pedstats$match.NA <- apply(is.na(pedchk$ped_arr[indix, "noDR", hbix, drop=FALSE]),
                             MARGIN=3, FUN=sum, na.rm=TRUE)
  pedstats$noDR.TRUE <- apply(pedchk$ped_arr[indix, "noDR", hbix, drop=FALSE],
                              MARGIN=3, FUN=sum, na.rm=TRUE)
  pedstats$noDR.FALSE <- apply(!(pedchk$ped_arr[indix, "noDR", hbix, drop=FALSE]),
                               MARGIN=3, FUN=sum, na.rm=TRUE)
  pedstats$withDR.TRUE <- apply(pedchk$ped_arr[indix, "withDR", hbix, drop=FALSE],
                                MARGIN=3, FUN=sum, na.rm=TRUE)
  pedstats$withDR.FALSE <- apply(!(pedchk$ped_arr[indix, "withDR", hbix, drop=FALSE]),
                                 MARGIN=3, FUN=sum, na.rm=TRUE)
  pedstats
}

calcFSstats <- function(ovwFS, haploblocks) {
  # function used only in calcStatistics; for input and output see there
  if (missing(haploblocks) || is.null(haploblocks)) {
    haploblocks <- rownames(ovwFS$ovw)
  } else if (is.numeric(haploblocks)) {
    haploblocks <- rownames(ovwFS$ovw)[haploblocks]
  } else if (is.list(haploblocks)) {
    haploblocks <- names(haploblocks)
  }
  if (!all(haploblocks %in% rownames(ovwFS$ovw)))
    stop("not all haploblocks occur in ovwFS")
  hbix <- match(haploblocks, rownames(ovwFS$ovw))
  if (length(hbix) == 0 || anyNA(hbix))
    stop("no selected haploblocks present in ovwFS")
  startcols <- which(substr(colnames(ovwFS$ovw), 1, 6) == "parmrk")
  FSstats <- data.frame(FSfam = 1:length(startcols))
  FSstats$bothparmrk <- colSums(ovwFS$ovw[hbix, startcols, drop=FALSE] == 2) #na.rm should not be needed!
  FSstats$FSFit <- colSums(ovwFS$ovw[hbix, startcols+1, drop=FALSE] == 1)
  FSstats$mrk <- colSums(ovwFS$ovw[hbix, startcols+3, drop=FALSE]) / length(hbix)
  FSstats$imp <- colSums(ovwFS$ovw[hbix, startcols+4, drop=FALSE]) / length(hbix)
  FSstats$hap <- colSums(ovwFS$ovw[hbix, startcols+5, drop=FALSE]) / length(hbix)
  startcols <- which(colnames(ovwFS$ovw) %in% c("mrkrest", "mrkall"))
  reststats <-
    data.frame(FSfam=c("rest", "all"), bothparmrk=c(NA, NA),
               FSFit=c(NA, NA),
               mrk=colSums(ovwFS$ovw[hbix, startcols, drop=FALSE]) / length(hbix),
               imp=c(0, sum(FSstats$imp)), #only imputation in FS families
               hap=c(sum(ovwFS$ovw[hbix, startcols[1]+1]),
                     sum(ovwFS$ovw[hbix, startcols[2]+2])) / length(hbix)
    )
  FSstats <- rbind(FSstats, reststats)
  rownames(FSstats) <- NULL
  list(FSstats=FSstats, haploblocks=haploblocks)
}

#'@title calculate some statistics of the solutions of all haploblocks
#'@description calculate some statistics of the solutions of all haploblocks
#'@usage calcStatistics(pedchk, ovwFS, indiv, haploblocks)
#'@param pedchk a list as generated by pedigreeHapCheck
#'@param ovwFS a list as produced by overviewFS, or NA if no FS families
#'available. If ovwFS is specified, pedchk and ovwFS should have
#'been calculated from the same underlying data
#'@param indiv missing, NULL or a vector of names of individuals. If present,
#'the component pedstats is calculated only over the individuals in the vector,
#'else over all individuals. Component FSstats is not affected by indiv
#'@param haploblocks missing, NULL or a vector of names or numbers of
#'haploblocks, or a list of haploblocks as in the inferHaplotypes input.
#'If present, components pedstats and FSstats are calculated only over the
#'specified haploblocks, else over all haploblocks.
#'@return a list with one or two items:\cr
#'pedstats: a data.frame with one row per haploblock, showing numbers
#'of individuals in columns\cr
#'- mrk: individuals with complete marker data\cr
#'- hap: individuals with an assigned haplotype combination\cr
#'- match.NA: individuals where a match between the haplotypes of the individual
#'  and its parents could not be verified (either the individual itself or both
#'  parents have no haplotype combination assigned)\cr
#'- noDR.TRUE: individuals whose haplotype combination matches that of its
#'  parents, assuming polysomic inheritance without Double Reduction\cr
#'- noDR.FALSE: individuals whose haplotype combination doesn't match that of its
#'  parents, assuming polysomic inheritance without Double Reduction\cr
#'- withDR.TRUE: individuals whose haplotype combination matches that of its
#'  parents, assuming polysomic inheritance, allowing Double Reduction\cr
#'- withDR.FALSE: individuals whose haplotype combination doesn't match that of
#'  its parents, assuming polysomic inheritance, allowing Double Reduction\cr
#'The total number of individuals is \cr
#'match.NA + noDR.TRUE + noDR.FALSE == match.NA + withDR.TRUE + withDR.FALSE;\cr
#'the number of individuals with missing marker data or without assigned
#'haplotype dosages can be calculated by subtracting mrk or hap from that
#'total.\cr
#'FSstats (only if ovwFS specified): a data.frame with for each FS family,
#'"rest" and "all" one row, with columns\cr
#'- pop: the FS family numbers, "rest" (all individuals not belonging to the
#'  FSs or their parents) and "all" (all individuals)\cr
#'- bothparmrk: for how many haploblocks have both parents complete marker data?\cr
#'- FSdone: for how many haploblocks has a solution based on polysomic
#'  inheritance been found for this FS family?\cr
#'- mrk: the average number of (FS) individuals with complete marker data\cr
#'- imp: the average number of (FS) individuals with imputed marker data\cr
#'- hap: the average number of (FS) individuals with an assigned haplotype
#'  combination\cr
#'The numbers for "all" include also the FS parents (each parent counted only
#'once, even if some parents produced more than one FS family), which are not
#'included in any of the other rows; therefore "all" is not equal to the sum of
#'the rows above.
#'@examples
#'data(PolyHaplotyper_small)
#'phchk <- pedigreeHapCheck(ped=phped, mrkDosage=phdos, haploblock=phblocks,
#'                           hapresults=phresults)
#'phovw <- overviewByFS(haploblock=phblocks, parents=phpar, FS=phFS,
#'                      hapresults=phresults)
#'calcStatistics(pedchk=phchk, ovwFS=phovw)
#'@export
calcStatistics <- function(pedchk, ovwFS, indiv, haploblocks) {
  if (missing(indiv)) indiv <- NULL
  if (missing(haploblocks)) haploblocks <- NULL
  if (missing(pedchk)) pedchk <- NULL else
  if (!is.list(pedchk) || !identical(names(pedchk), c("ped_arr", "parents_arr")))
    stop ("invalid pedchk")
  if (missing(ovwFS)) ovwFS <- NULL else
    if (!is.list(ovwFS) || !identical(names(ovwFS), c("ovw", "messages")))
      stop("invalid ovwFS")
  if (is.null(pedchk) && is.null(ovwFS))
    stop("no data")

  result <- list()
  if (!is.null(pedchk))
    result$pedstats <- calcPedstats(pedchk, indiv, haploblocks)
  if (!is.null(ovwFS)) {
    FSstats <- calcFSstats(ovwFS, haploblocks)
    result$FSstats <- FSstats$FSstats
  }
  if (!is.null(pedchk) && !is.null(ovwFS) &&
      !setequal(result$pedstats$haploblock, FSstats$haploblocks)) {
    cat("The haploblocks in pedchk and ovwFS do not match\n"); flush.console()
  }
  result
} #calcStatistics

#'@title produce a table of nr of markers vs nr of haplotypes
#'@description produce a table of nr of markers vs nr of haplotypes
#'@usage calcMrkHaptable(ovwFS)
#'@param ovwFS a list as produced by overviewFS. Only the first 2 columns of
#'list item ovw are used
#'@return a frequency table with the numbers of haploblocks, with
#'all combinations of marker counts and inferred haplotype counts per haploblock.
#'The column with haplotype count NA (if any) shows the haploblocks for which
#'no haplotype solution was found (the reason for that would usually be found
#'in column 1 of ovwFS$messages)
#'@examples
#'data(PolyHaplotyper_small)
#'phovw <- overviewByFS(haploblock=phblocks, parents=phpar, FS=phFS,
#'                      hapresults=phresults)
#'# in this small dataset there are only 2 haploblocks, each with 4 markers:
#'calcMrkHaptable(ovwFS=phovw)
#'# in both haploblocks 5 haplotypes are inferred
#'@export
calcMrkHaptable <- function(ovwFS){
  table(ovwFS$ovw[ ,1], ovwFS$ovw[, 2], useNA="ifany",
                     dnn=c("mrk", "haplotypes"))
} # calcMrkHaptable

#'@title convert a haploblock-defining data frame to a list
#'@description convert a haploblock-defining data frame to a list as needed
#'by inferHaplotypes
#'@usage haploblock_df2list(df, mrkcol, hbcol, sorted=TRUE)
#'@param df a data.frame with at least the following two columns
#'@param mrkcol the name or number of the column with the markers
#'@param hbcol the name or number of the column with the haploblocks
#'@param sorted if TRUE (default) the haploblock list is sorted in alphabetical
#'order of haploblock name (the markers within haploblocks are not sorted);
#'if FALSE the haploblocks will be in order of first occurrence in df
#'@details function inferHaplotypes needs a list where each item is a
#'vector of all markers in one haploblock. This function produces such a list
#'from a data.frame where the markers are in one column and the haploblocks
#'in another column. The markers and haploblocks columns may be character,
#'factor or numeric, and the columns may be indicated by name or number.
#'@return the desired list
#'@examples
#'df1 <- data.frame(
#'  marker=paste0("mrk",1:9),
#'  block=LETTERS[c(1,2,3,2,3,1,1,2,2)],
#'  extracol=runif(9))
#'haploblock_df2list(df=df1, mrkcol="marker", hbcol=2)
#'@export
haploblock_df2list <- function(df, mrkcol, hbcol, sorted=TRUE) {
  uhb <- unique(df[, hbcol])
  if (is.factor(df[, mrkcol])) df[, mrkcol] <- as.character(df[, mrkcol])
  lst <- vector(length(uhb), mode="list")
  for (hb in seq_along(uhb)) {
    lst[[hb]] <- df[df[, hbcol]==uhb[hb], mrkcol]
  }
  names(lst) <- as.character(uhb)
  if (sorted) lst <- lst[order(names(lst))]
  lst
  # split(df[,mrkcol], df[,hbcol]) would also work but:
  # - the haploblocks might be reordered if they are not alphabetically sorted
  # - if df[,mrkcol] is a factor, the returned list will also consist of
  #   factors rather than character vectors
} # haploblock_df2list

# .........................................................
# Functions to compare with other haplotyping software ####
# .........................................................

#'@title convert PolyHaplotyper marker data to ShesisPlus format
#'@description convert PolyHaplotyper marker data (input) to ShesisPlus
#'format for a single haploblock.
#'@usage make.ShesisPlus.input(mrkDosage, indiv=NULL, markers,
#'ploidy, phenotype=0, fname="")
#'@param mrkDosage matrix or data.frame of allele dosages; same as input for
#'inferHaplotypes. Markers are in rows, individuals in columns, each cell has
#'a marker dosage. All marker dosages must be in 0:ploidy or NA
#'@param indiv the names of the individuals to include in the ShesisPlus
#'input data. Default NULL includes all individuals
#'@param markers character vector with the names of the markers in the
#'haploblock; all must occur in mrkDosage
#'@param ploidy single integer: the ploidy level
#'@param phenotype vector with the phenotypes of all individuals, in order
#'of the columns of dosmat; default 0
#'@param fname filename of the output file: this will contain the data
#'in the ShesisPlus format (the saved data.frame is also the return value).
#'If "" (default) no file is written
#'@details ShesisPlus needs the data formatted as: 1 column with names of
#'individuals, 1 column with the phenotypes (with missing values represented
#'as "NA"), and for each of the selected markers <ploidy> columns,
#'with the (unphased) marker alleles. Here we use only biallelic markers;
#'their alleles are indicated by 1 and 2 for the ref and alt allele,
#'and 0 for missing data. \cr
#'The contents of file fname can be pasted into the input data box of
#'the ShesisPlus web interface at http://shesisplus.bio-x.cn/SHEsis.html
#'@return a data.frame in the described ShesisPlus input format
#'@examples
#'data(PolyHaplotyper_small)
#'SSPin <- make.ShesisPlus.input(mrkDosage=phdos, markers=phblocks[[1]],
#'                               ploidy=6)
#'SSPin[1:6,1:8]
#'@export
make.ShesisPlus.input <- function (mrkDosage, indiv=NULL, markers,
                                   ploidy, phenotype=0, fname="") {
  dm <- checkmrkDosage(mrkDosage, ploidy=ploidy, markers=markers) # also converts to matrix
  if (length(phenotype) == 1) phenotype <- rep(phenotype, ncol(dm))
  if (length(phenotype) != ncol(dm)) stop("Incorrect number of phenotypes")
  if (anyNA(phenotype)) phenotype[is.na(phenotype)] <- "NA"
  result <- matrix(numeric(0), nrow=ncol(dm), ncol=ploidy*nrow(dm),
                   dimnames=list(colnames(dm), rep(rownames(dm), each=ploidy)))
  for (m in seq_along(markers)) {
    # in order to use the hapdos2hapcomb function of PolyHaplotyper
    # we need two rows: one for the 1 and one for the 2 allele:
    mdat <- rbind(ploidy-dm[m,], dm[m,])
    result[, (ploidy*(m-1)+1):(ploidy*m)] <- t(hapdos2hapcomb(mdat, ploidy))
  }
  result[is.na(result)] <- 0 # ShesisPlus code for missing allele
  result <- cbind(phenotype, result)
  if (fname != "") write.table(result, file=fname, row.names=TRUE, col.names=FALSE,
                               quote=FALSE, sep=" ")
  result
}

#'@title Read the haplotyping results from the ShesisPlus output
#'@description Read the haplotyping results from the ShesisPlus output
#'@usage read.ShesisPlus.output(SSPout, order.by=c("", "hapnr", "count")[2])
#'@param SSPout filename of a text file with ShesisPlus output (as copied from
#'the web page produced by running ShesisPlus web interface
#'at http://shesisplus.bio-x.cn/SHEsis.html); or a character vector with the
#'same output
#'@param order.by how to order the rows of the hapstat data.frame. "hapnr" means
#'ordering by haplotype number, "count" means ordering by decreasing Total.count,
#'anything else results in no reordering. By default order by hapnr.
#'@details If present, the markernames and the haplotype statistics are read from
#'the file. ShesisPlus does not provide haplotype combinations for individuals
#'@return a list with 2 items: $markernames has the marker names in the markers in
#'the haploblock (if present in the file); hapstat contains the haplotype statistics
#'as read from the file with an additional (first) column hapnr: the haplotype
#'numbers as defined in PolyHaplotyper. The other columns are Haplotype (as sequences
#'of marker alleles), Total.count (of the haplotype over the whole population) and,
#'according to an email from Zhiqiang Li of 13-04-2020) BETA: Regression coefficient,
#'SE: Standard error, R2: Regression r-squared, T: t-distribution statistics,
#'P: p-value
#'@examples
#'# we give a typical SSP output as character vector; instead we could also
#'# give the name of a text file
#'SSPout <- c(
#'  " Please cite:",
#'  "",
#'  "    Shen, J. et al. SHEsisPlus, a toolset for genetic studies ...",
#'  "    Shi, Y. et al. SHEsis, a powerful software platform ...",
#'  "    Li, Z. et al. A partition-ligation-combination-subdivision ...",
#'  "",
#'  "if you find this tool useful in your research. Thanks!",
#'  "",
#'  "Haplotype Analysis:",
#'  "",
#'  "Haplotypes with frequency <0.03 are ignored.",
#'  "Loci chosen for haplotype analysis: m1, m2, m3, m4",
#'  "Haplotype 	Total count 	Beta 	SE 	R2 	T 	p",
#'  "1122 	2232 	0.002 	0.01 	1.01e-04 	0.252 	0.8",
#'  "1222 	230 	-0.019 	0.017 	0.002 	-1.15 	0.25",
#'  "1221 	152 	-0.01 	0.024 	2.94e-04 	-0.43 	0.667",
#'  "2222 	288 	0.008 	0.02 	2.93e-04 	0.429 	0.667",
#'  "1121 	142 	-0.009 	0.022 	2.81e-04 	-0.42 	0.674"
#')
#'read.ShesisPlus.output(SSPout)
#'@export
read.ShesisPlus.output <- function(SSPout, order.by=c("", "hapnr", "count")[2]) {
  if (missing(order.by) || is.null(order.by) || is.na(order.by)) order.by <- ""
  order.by <- order.by[1]
  fromfile <- length(SSPout)==1 && file.exists(SSPout)
  if (!fromfile && !(is.character(SSPout) && length(SSPout)>1))
    stop("SSPout must be a text file name or a character vector")
  if (fromfile) con <- file(SSPout) else con=textConnection(SSPout)
  txt <- readLines(con)
  i <- grep("Loci chosen for haplotype analysis:", txt, value=FALSE)
  markernames <- character(0)
  if (length(i) == 1) {
    # read marker names from line, after ": " and separated by ", "
    colon <- regexpr(": ", txt[i])
    commas <- gregexpr(", ", txt[i])[[1]]
    commas <- c(colon, commas, nchar(txt)[i]+1)
    markernames <- substring(txt[i], commas[1:(length(commas)-1)]+2,
                             commas[2:length(commas)]-1)
  }
  # next, get the haplotype statistics:
  i <- grep("Haplotype \tTotal count \tBeta \tSE \tR2 \tT \tp", txt, value=FALSE)
  if (length(i) != 1) stop("header line of haplotype statistics not found")
  # find out if there is an LD section after the statistics:
  LD <- grep("Linkage Disequilibrium Analysis:", txt, value=FALSE)
  if (length(LD) > 0) nrows <- LD[1] - i - 1 else nrows <- -1
  if (fromfile) con <- file(SSPout) else con <- textConnection(SSPout)
  hapstat <- read.table(con, header=TRUE, sep="\t", skip=i-1, nrows=nrows,
                        na.strings=c("NA", "-NA", "nan", "-nan", ""))
  close(con)
  if (order.by == "count") hapstat <- hapstat[order(hapstat$Total.count, decreasing=TRUE),]
  # check length of haplotypes:
  nc <- nchar(hapstat$Haplotype)
  if (min(nc) != max(nc)) {
    warning("Haplotypes have different lengths")
    hapstat <- cbind(rep(NA, nrow(hapstat)), hapstat)
    } else {
      if (length(markernames) == 0) {
        markernames <- paste0("m", seq_len(min(nc)))
      } else if (length(markernames) != min(nc)) {
        warning("markernames don't match length of haplotypes")
      }
      # translate haplotypes to haplotype numbers
      allhap <- apply(allHaplotypes(character(min(nc))) + 1, # in SSP we use alleles 1&2, not 0&1
                      MARGIN=1, FUN=paste, collapse="")
      hapstat <- cbind(match(hapstat$Haplotype, allhap), hapstat)
      if (order.by == "hapnr") hapstat <- hapstat[order(hapstat[,1]),]
    }
  names(hapstat)[1] <- "hapnr"
  list(hapstat=hapstat, markernames=markernames)
} #read.ShesisPlus.output

#'@title convert PolyHaplotyper marker data to SATlotyper format
#'@description convert PolyHaplotyper marker data (input) to SATlotyper
#'format for a single haploblock.
#'@usage make.SATlotyper.input(mrkDosage, indiv=NULL, markers,
#'ploidy, phenotype=0, fname)
#'@param mrkDosage matrix or data.frame of allele dosages; same as input for
#'inferHaplotypes. Markers are in rows, individuals in columns, each cell has
#'a marker dosage. All marker dosages must be in 0:ploidy or NA
#'@param indiv the names of the individuals to include in the ShesisPlus
#'input data. Default NULL includes all individuals
#'@param markers character vector with the names of the markers in the
#'haploblock; all must occur in mrkDosage
#'@param ploidy single integer: the ploidy level
#'@param phenotype vector with the phenotypes of all individuals, in order
#'of the columns of dosmat; default 0
#'@param fname filename of a *.csv output file: this will contain the data
#'in the SATlotyper format (the saved data.frame is also the return value).
#'If "" no file is written
#'@return a data.frame in the SATlotyper input format: a header row with
#'"Genotype" and the marker names, and one row per individual with the
#'individual name plus for each marker the genotype as a sorted
#'string of <ploidy> A's and B's, or <ploidy> N's
#'@examples
#'data(PolyHaplotyper_small)
#'SATin <- make.SATlotyper.input(mrkDosage=phdos, markers=phblocks[[1]],
#'                               ploidy=6, fname="")
#'head(SATin)
#'@export
make.SATlotyper.input <- function (mrkDosage, indiv=NULL, markers,
                                   ploidy, phenotype=0, fname) {
  sep <- "\t" # separator of the columns in the output file
  dm <- checkmrkDosage(mrkDosage, ploidy=ploidy, markers=markers) # also converts to matrix
  result <- matrix(NA_character_, nrow=ncol(dm), ncol=nrow(dm),
                   dimnames=list(colnames(dm), rownames(dm)))
  dos2str <- function(dosage, ploidy) {
    # e.g. (ploidy=4): dosage 3 -> ABBB, dosage 0 -> AAAA, dosage NA -> NNNN
    alleles <- c("A", "B")
    miss <- "N"
    if (is.na(dosage)) paste(rep(miss, ploidy), collapse="") else {
      str0 <- paste(rep(alleles[1], ploidy-dosage), collapse="")
      str1 <- paste(rep(alleles[2], dosage), collapse="")
      paste0(str0, str1)
    }
  }
  for (i in seq_len(ncol(result))) {
    result[, i] <- sapply(dm[i,], FUN=dos2str, ploidy=ploidy)
  }
  if (length(fname)==1 && !is.na(fname) && fname != "") {
    con <- file(fname)
    writeLines(paste("Genotype", paste(colnames(result), collapse=sep), sep=sep),
               con=con, sep="\n")
    close(con)
    write.table(result, file=fname, col.names=FALSE, row.names=TRUE, quote=FALSE,
                sep=sep, append=TRUE)
  }
  result
} #make.SATlotyper.input

#'@title read the haplotyping results from the SATlotyper output
#'@description read the haplotyping results from the SATlotyper output
#'@usage read.SATlotyper.output(fname, output="", allelecodes=c("A", "B"),
#'sep, haploblockname="")
#'@param fname filename of an xml file produced by SATlotyper
#'@param output character vector, console output of SATlotyper;
#'this contains a table of calculated haplotypes with more info
#'than the xml file
#'@param allelecodes the codes used for the ref and alt SNP alleles in
#'make.SATlotyper.input, default "A" and "B"
#'@param sep the separator used in make.SATlotyper.input (SATlotyper cleverly
#'identifies this separator and uses it for its output), default tab
#'@param haploblockname if not "" (default) the haplotype IDs are given as
#'<haploblockname>_<hapnr>, else just as <hapnr>, left_padded with zeroes
#'@details The xml file is parsed using the package XML. The resulting list has
#'3 items named source, bootstrapping and haplotyping. In this function source is
#'ignored and the results are obtained from haplotypings[[1]] (i.e. even if multiple
#'haplotypings were done only the first is extracted), with the haplotype
#'score from the bootstrapping item added to the haplotype.info.
#'@return a list with 3 items: \cr
#'$hapdos is a matrix with individuals in columns and haplotypes in rows,
#'giving the dosages of the haplotypes in each individual (summing to ploidy).
#'This is the same format as the hapdos components of the inferHaplotypes
#'results of PolyHaplotyper \cr
#'$haplotype.info is a data.frame that is a combination of information from two
#'or three tables:\cr
#'columns Haplotype and necessity are from element
#'haplotypings[[1]]$haplotypes in the xml file, \cr
#'column hapnr gives the PolyHaplotyper equivalents of the Haplotypes, \cr
#'column score is from element bootstrapping in the xml file, \cr
#'columns id, number, frequency, homozygous, necessary, distance and
#'neighbours are from (the console) output if present. \cr
#'The haplotype.info data.frame is ordered by hapnr. \cr
#'$HaplotypingScore is the single number in $haplotyping[[1]]$score.
#'
#'@export
read.SATlotyper.output <- function(fname, output="",
                                   allelecodes=c("A", "B"), sep,
                                   haploblockname="") {
  if (missing(sep)) sep <- "\t" #separator in input file
  getAllhapStrings <- function(nmrk, allelecodes){
    result <- character(2^nmrk)
    allhap <- allHaplotypes(seq_len(nmrk))
    result <- allhap #dim correct
    result[allhap==0] <- allelecodes[1]
    result[allhap==1] <- allelecodes[2]
    apply(result, MARGIN=1, paste, collapse="")
  } #getAllhapStrings in read.SATlotyper.result

  xpl <- XML::xmlToList(XML::xmlParse(file=fname))
  markernames <- strsplit(xpl$source[[1]], sep)[[1]][-1]
  names(markernames) <- NULL # all names are "genotype"
  nmrk <- length(markernames)
  allHapstrings <- getAllhapStrings(nmrk, allelecodes)
  HaplotypingScore <- xpl$haplotypings$haplotyping$score
  if (!anyNA(as.numeric(HaplotypingScore)))
    HaplotypingScore <- as.numeric(HaplotypingScore)
  usedhap2 <- t(sapply(xpl$haplotypings[[1]]$haplotypes, FUN=function(x) unlist(x)))
  usedhaplo <- data.frame(Haplotype=usedhap2[,1])
  usedhaplo$hapnr <- match(usedhaplo$Haplotype, allHapstrings)
  usedhaplo$necessity=usedhap2[,2] != "false"
  usedhaplo <- usedhaplo[order(usedhaplo$hapnr),]
  bootstrapping <- t(sapply(xpl$bootstrapping, FUN=function(x) unlist(x)))
  if (anyNA(as.numeric(bootstrapping[, 2]))) score <- bootstrapping[, 2] else
    score <- as.numeric(bootstrapping[, 2])
  usedhaplo$score <- score[match(bootstrapping[,1], usedhaplo$Haplotype)]
  hapnames <- padded(usedhaplo$hapnr, max(usedhaplo$hapnr))
  if (haploblockname[1] != "")  hapnames <- paste(haploblockname[1], hapnames, sep="_")
  strgeno <- sapply(xpl$haplotypings[[1]]$genotypes, FUN=function(x) unlist(x))
  hapgeno <- matrix(match(strgeno[-1,], allHapstrings), ncol=ncol(strgeno))
  if (any(is.na(hapgeno) & !is.na(strgeno[-1,])))
    stop("some assigned haplotypes not among used haplotypes")
  if (length(setdiff(usedhaplo$hapnr, unique(hapgeno))) > 0)
    stop("some used haplotypes not among assigned haplotypes")
  hapdos <- matrix(0, nrow=nrow(usedhaplo), ncol=ncol(hapgeno),
                   dimnames=list(hapnames, strgeno[1,]))
  for (i in seq_len(nrow(usedhaplo))) {
    hn <- usedhaplo$hapnr[i]
    hapdos[i,] <- apply(hapgeno, MARGIN=2, FUN=function(x) sum(x==hn, na.rm=TRUE))
  }
  names(colnames(hapdos)) <- NULL
  # finally we check if the table with extended haplotype info is
  # available from the console output:
  header <- grep("id | sequence |", output, fixed=TRUE)
  end <- grep("calculated phased genotypes", output, fixed=TRUE)
  if (length(header) == 1 && length(end) == 1) {
    if (end - header != nrow(usedhaplo) + 3) {
      warning("table calculated.haplotypes in output doesn't match haplotypes in xml file")
    } else {
      headers <- trimws(strsplit(output[header], " | ", fixed=TRUE)[[1]])
      valu <- trimws(t(simplify2array(strsplit(output[(header+2):(end-2)], " | ", fixed=TRUE))))
      # we assume here that the columns are always the same:
      if (length(headers) != ncol(valu) || length(headers) != 8) {
        warning("in output table calculated.genotypes has unexpected format")
      } else {
        numconv <- function(x)
          if (anyNA(suppressWarnings(as.numeric(x)))) x else as.numeric(x)
        xhapinfo <- data.frame(
          id=numconv(valu[,1]),
          sequence=valu[,2],
          number=numconv(valu[,3]),
          frequency=numconv(valu[,4]),
          homozygous=valu[,5]=="true",
          necessary=valu[,6]=="true",
          distance=numconv(valu[,7]),
          neighbours=numconv(valu[,8])
        )
        names(xhapinfo) <- headers # just in case they are different
        o <- match(xhapinfo$sequence, usedhaplo$Haplotype)
        if (anyNA(o)) {
          warning("table calculated.haplotypes in output doesn't match haplotypes in xml file")
        } else {
          usedhaplo <- cbind(usedhaplo, xhapinfo[o, -2])
        }
      }

    }
  }
  list(hapdos=hapdos, haplotype.info=usedhaplo, HaplotypingScore=HaplotypingScore,
       markers=markernames)
} #read.SATlotyper.output

#'@title A simple interface to run SATlotyper
#'@description A simple interface to run SATlotyper
#'@usage run.SATlotyper(path_to_SATlotyper, infile, outfile,
#'SAT_solver="sat4j.conf")
#'@param path_to_SATlotyper path to the folder where SATlotyper.jar and
#'the *.conf files of the SAT-solvers are located
#'@param infile name (and path) to the SATlotyper input file
#'@param outfile name (and path) for the SATlotyper output file (an xml file)
#'@param SAT_solver name of the *.conf file for the SAT-solver to use.
#'Default "sat4j.conf" because this works under both Windows and Linux.
#'@details This function issues a system command to invoke SATlotyper.
#'Java and SATlotyper must be installed. This is just a simple interface for
#'convenience; for more control run SATlotyper directly from the command window.
#'@return The return is a list with elements:\cr
#'$cmd : the command passed to the system\cr
#'$result: screen output from SATlotyper; this
#'contains some extra info not present in the output xml file)\cr
#'The main result is the outfile
#'@export
run.SATlotyper <- function(path_to_SATlotyper, infile, outfile,
                           SAT_solver="sat4j.conf") {
  cmd <- paste0('java -jar "', file.path(path_to_SATlotyper, 'SATlotyper.jar"'),
                ' -all -i "', infile, '" -o "', outfile, '"',
                ' -sat "', file.path(path_to_SATlotyper, SAT_solver), '"')
  # in R.Version 4.X this will be improved using "raw strings" so that
  # no \" sequences appear which prevent copy-pasting the cmd into
  # a Windows system prompt window
  args <- c(paste0('-jar "', file.path(path_to_SATlotyper, 'SATlotyper.jar"'),
            ' -all',
            paste0(' -i "', infile, '"'),
            paste0(' -o "', outfile, '"'),
            paste0(' -sat "', file.path(path_to_SATlotyper, SAT_solver), '"')))
  result <- system2(command="java", args=args, stdout=TRUE, stderr=TRUE)
  invisible(c(cmd, result))
}

#'@title convert PolyHaplotyper input data to Happy-inf format
#'@description convert PolyHaplotyper input data to Happy-inf format
#'for a single haploblock
#'@usage make.Happyinf.input(mrkDosage, indiv=NULL, haploblock,
#'ploidy, fname)
#'@param mrkDosage matrix or data.frame of allele dosages; same as input for
#'inferHaplotypes. Markers are in rows, individuals in columns, each cell has
#'a marker dosage. All marker dosages must be in 0:ploidy or NA
#'@param indiv the names of the individuals to include in the Happy-inf
#'input data. Default NULL includes all individuals
#'@param haploblock a list of character vectors. The names are the names of the
#'haploblocks, the character vectors have the names of the markers in each
#'haploblock. Only the markers in haploblock will be included in the
#'Happy-inf input data
#'@param ploidy single integer: the ploidy level
#'@param fname filename of a tab-separated output file: this will contain the
#'data in Happy-inf format (the saved data.frame is also the return value).
#'If "" no file is written
#'@return a data.frame in the Happy-inf input format: a header row with
#'"SNPID", "block" and names of the individuals and one row per marker, with
#'only the individuals, markers and blocks as specified.
#'#'SNPID has the marker names, "Block" the haploblock names. All markers
#'of a haploblock are in contiguous rows. Missing dosages
#'are represented by "NA"
#'@examples
#'data(PolyHaplotyper_small)
#'HAPin <- make.Happyinf.input(mrkDosage=phdos, haploblock=phblocks,
#'                             ploidy=6, fname="")
#'HAPin[,1:8]
#'@export
make.Happyinf.input <- function (mrkDosage, indiv=NULL, haploblock,
                                   ploidy, fname) {
  result <- checkmrkDosage(mrkDosage, ploidy=ploidy, indiv=indiv,
                           markers=unlist(haploblock))
  result <- data.frame(SNPID=rownames(result),
                       Block=rep(names(haploblock), times=sapply(haploblock, length)),
                       result,
                       check.names=FALSE,
                       stringsAsFactors=FALSE)
  rownames(result) <- NULL
  if (length(fname)==1 && !is.na(fname) && fname != "") {
    write.table(result, file=fname,
                row.names=FALSE, col.names=TRUE,
                sep="\t", na="NA", quote=FALSE)
  }
  invisible(result)
} #make.Happyinf.input

#'@title read the haplotyping results from the Happy-inf output
#'@description read the haplotyping results from the Happy-inf output
#'@usage read.Happyinf.output(file_prefix, dropUnused=TRUE)
#'@param file_prefix the prefix for the output files generated by
#'Happy-inf (the value of the -o parameter)
#'@param dropUnused TRUE (default) if the returned matrix should only contain
#'rows for haplotypes that are present; if FALSE matrix contains rows for all
#'possible haplotypes
#'@details This function reads the <file_prefix>.stat.dat and
#'<file_prefix>.stats.dat files. The first contains one row per SNP marker,
#'grouped per haploblock and <ploidy> columns per individual with the SNP
#'haplotypes, where SNP alleles are represented as 0 or 1, or NA for unknown.
#'For one SNP, the <ploidy> alleles in an individual are all known or all
#'unknown It is possible that of the different SNPs in a haploblock
#'some are known and some are not; in the conversion to PolyHaplotyper format
#'all these partially known haplotypes are made unknown.
#'The latter file has a statistics "mismatch" and "ratio" for each individual /
#'haploblock combination; for their meaning see the Happyinf readme file.
#'@return a list with 2 items: \cr
#'$hapdos is a matrix with individuals in columns and haplotypes in rows,
#'giving the dosages of the haplotypes in each individual (summing to ploidy).
#'This is the same format as the hapdos components of the inferHaplotypes
#'results of PolyHaplotyper except that \cr
#'$stats is a matrix with individuals in rows (matching dosmat) and two rows
#'named mismatch and ratio.
#'
#'@export
read.Happyinf.output <- function(file_prefix, dropUnused=TRUE) {
  dat <- read.table(paste0(file_prefix, ".haplotypes.dat"), header=TRUE,
                    sep="\t", stringsAsFactors=FALSE, check.names=FALSE,
                    na.strings="NA")
  stat <- read.table(paste0(file_prefix, ".stat.dat"), header=TRUE,
                     sep="\t", stringsAsFactors=FALSE, check.names=FALSE,
                     na.strings=c("NA", "nan"))
  block <- unique(dat$Block)
  indiv <- names(stat)[-(1:2)]
  ploidy <- (ncol(dat)-2) / length(indiv)
  if (names(dat)[1]!="Block" || names(stat)[1]!="Block" ||
      !identical(block, unique(stat$Block)) ||
      !identical(indiv, names(stat)[-(1:2)]))
    stop("haplotypes and stat files don't match")
  results <- vector(length=length(block), mode="list")
  names(results) <- block
  for (b in seq_along(results)) {
    # first hapdos:
    brows <- dat$Block==block[b]
    markers <- dat$SNP_ID[brows]
    nhap <- 2^length(markers)
    datb <- as.matrix(dat[brows, -(1:2)])
    # convert haplotypes to haplotype numbers:
    NAcols <- is.na(colSums(datb))
    datb[, NAcols] <- NA
    print(paste(b, block[b])); flush.console()
    hapnrs <- mrkdos2mrkdid(mrkDosage=datb, indiv=NULL, ploidy=1, check=FALSE)
    #hapnrs is now a vector of hapnrs, of length ploidy * length(indiv)
    hapdos <- hapcomb2hapdos(hapcomb=matrix(hapnrs, nrow=ploidy, byrow=FALSE),
                             nhap=nhap)
    rownames(hapdos) <- paste0(block[b], "_", padded(1:nhap))
    colnames(hapdos) <- indiv
    if (dropUnused) {
      hapdos <-
        hapdos[rowSums(hapdos, na.rm=TRUE) > 0, , drop=FALSE]
    }
    results[[b]] <- list(hapdos=hapdos, markers=markers)
    # then stats:
    brows <- stat$Block==block[b]
    results[[b]]$stats <- as.matrix(stat[brows, -(1:2)])
    rownames(results[[b]]$stats) <- stat$type[brows] # mismatch and ratio
  }
  results
} # read.Happyinf.output

#'@title compare two haplotyping results
#'@description compare two haplotyping results, e.g. PolyHaplotyper and
#'SATlotyper
#'@usage compareHapresults(haploblock, hapresultsA, hapresultsB)
#'@param haploblock a list of character vectors. The names are the names of the
#'haploblocks, the character vectors have the names of the markers in each
#'haploblock.
#'@param hapresultsA and
#'@param hapresultsB two list as returned by inferHaplotypes,
#'with one item (itself a list) per haploblock with at least a matrix hapdos
#'and a character vector markers. All
#'haploblocks in param haploblock must occur in hapresultsA and in hapresultsB.
#'The individual names (colnames of the hapdos items for each haploblock)
#'must be identical and in the same order in hapresultsA and hapresultsB
#'@return a list with one element per haploblock in param haploblock.
#'Each element is itself a list with elements:
#'$identical: TRUE or FALSE\cr
#'$message: a single string, "" if the comparison is possible, else the reason
#'why not (if $message is not "", $identical is always FALSE). The next
#'elements are only present if $message is "":\cr
#'$compindiv: a matrix comparing the two hapdos, with one column per individual
#'and 5 rows: Both_NA, A_NA, B_NA, Equal, Uneq. The last 2 have NA values if A
#'and/or B is NA
#'$haplofreq: a matrix with one row per haplotype occurring in A and/or B,
#'and columns A and B, with the total frequency of each haplotype in
#'hapdos A or hapdos B
#'@export
compareHapresults <- function(haploblock, hapresultsA, hapresultsB) {
  if (!all(names(haploblock) %in% names(hapresultsA)) ||
      !all(names(haploblock) %in% names(hapresultsB)))
    stop("haploblocks in hapresultsA or hapresultsB don't match haploblock")
  if (ncol(hapresultsA[[1]]$hapdos) != ncol(hapresultsB[[1]]$hapdos) ||
      !all(colnames(hapresultsA[[1]]$hapdos) == colnames(hapresultsB[[1]]$hapdos)))
    stop("individuals in hapresultsA and hapresultsB not identical")
  compresults <- vector("list", length(haploblock))
  names(compresults) <- names(haploblock)
  for (hb in names(haploblock)) {
    compresults[[hb]]$identical <- FALSE
    if (!all(haploblock[[hb]] == hapresultsA$markers) ||
        !all(haploblock[[hb]] == hapresultsB$markers)) {
      compresults[[hb]]$message <-
        "markers in hapresultsA or hapresultsB different from haploblock"
      break
    }
    A <- hapresultsA[[hb]]$hapdos
    B <- hapresultsB[[hb]]$hapdos
    ploidy <- setdiff(unique(c(colSums(A, na.rm=TRUE), colSums(B, na.rm=TRUE))),
                      0)
    if (length(ploidy) != 1) {
      compresults[[hb]]$message <- "not all haplotype dosages sum to ploidy"
      break()
    }
    compresults[[hb]]$message <- ""
    A <- A[rowSums(A, na.rm=TRUE) > 0,, drop=FALSE]
    Ahap <- split_hapnames(rownames(A))$hapnrs
    B <- B[rowSums(B, na.rm=TRUE) > 0,, drop=FALSE]
    Bhap <- split_hapnames(rownames(B))$hapnrs
    # make Anw and Bnw as A and B, but containing all haplotypes in either
    x <- setdiff(Bhap, Ahap)
    xm <- matrix(rep(ploidy-colSums(A), length(x)), byrow=TRUE, ncol=ncol(A), nrow=length(x),
                 dimnames=list(rownames(B)[Bhap %in% x], colnames(A)))
    Anw <- rbind(A, xm)
    Anw <- Anw[order(c(Ahap, x)),,drop=FALSE]
    x <- setdiff(Ahap, Bhap)
    xm <- matrix(rep(ploidy-colSums(B), length(x)), byrow=TRUE, ncol=ncol(B), nrow=length(x),
                 dimnames=list(rownames(A)[Ahap %in% x], colnames(B)))
    Bnw <- rbind(B, xm)
    Bnw <- Bnw[order(c(Bhap, x)),,drop=FALSE]
    compresults[[hb]]$compIndiv <-
      matrix(NA, nrow=5, ncol=ncol(Anw),
             dimnames=list(c("Both_NA", "A_NA", "B_NA", "Equal", "Uneq"),
                           colnames(Anw)))
    compresults[[hb]]$compIndiv["Both_NA",] <- is.na(colSums(Anw)) & is.na(colSums(Bnw))
    compresults[[hb]]$compIndiv["A_NA",] <- is.na(colSums(Anw)) & !is.na(colSums(Bnw))
    compresults[[hb]]$compIndiv["B_NA",] <- !is.na(colSums(Anw)) & is.na(colSums(Bnw))
    compresults[[hb]]$compIndiv["Equal",] <- colSums(Anw != Bnw) == 0
    compresults[[hb]]$compIndiv["Uneq",] <- colSums(Anw != Bnw) > 0
    compresults[[hb]]$identical <-
      sum(compresults[[hb]]$compIndiv["Both_NA",]) +
      sum(compresults[[hb]]$compIndiv["Equal",], na.rm=TRUE) == ncol(Anw)
    compresults[[hb]]$haplofreq <-
      matrix(NA_integer_, nrow=nrow(Anw), ncol=2,
             dimnames=list(rownames(Anw), c("A", "B")))
    compresults[[hb]]$haplofreq[,"A"] <- rowSums(Anw, na.rm=TRUE)
    compresults[[hb]]$haplofreq[,"B"] <- rowSums(Bnw, na.rm=TRUE)
  } # for hb
  compresults
} # compareHapresults

#'@title compare haplotyping results with observed markers dosages
#'@description compare haplotyping results with observed markers dosages
#'@usage compareHapMrkDosages(mrkDosage, hapresults)
#'@param mrkDosage a data.frame or matrix with the marker dosages, in
#'the format of inferHaplotypes()
#'@param hapresults a list as returned by inferHaplotypes,
#'with one item (itself a list) per haploblock with at least a matrix hapdos
#'and a character vector markers. All markers in all haploblocks must also
#'occur in mrkDosages, and all individuals in the hapdos matrices must also
#'occur in mrkDosages
#'@return a 3-D array with dimensions haploblock, individual, and chkresult:
#'mrkNA (ALL markers in the haploblock have missing data for the individual),
#'hapNA (the haplotype dosages do not sum to ploidy and/or are missing),
#'match (TRUE if the non-missing marker dosages match the haplotype dosages,
#'FALSE is there is a conflict, NA if mrkNA and/or hapNA are TRUE)
#'Each element is itself a list with elements:
#'@examples
#'data(PolyHaplotyper_small)
#'chmd <- compareHapMrkDosages(mrkDosage=phdos, hapresults=phresults)
#'# show results for first haploblock, first 8 individuals:
#'chmd[1, 1:8,]
#'@export
compareHapMrkDosages <- function(mrkDosage, hapresults) {
  if (!all(sapply(hapresults,
                  FUN=function(x) all(c("hapdos", "markers") %in% names(x)))))
    stop("all elements of hapresults must at least have components hapdos and markers")
  hblist <- lapply(hapresults, FUN=function(x) x$markers)
  ploidy <- max(sapply(hapresults,
                       FUN=function(x) max(colSums(x$hapdos), na.rm=TRUE)))
  mrkDosages <- checkmrkDosage(mrkDosage=mrkDosage, ploidy=ploidy,
                               indiv=colnames(hapresults[[1]]$hapdos),
                               markers=unlist(hblist))
  # now with columns in same order as the hapdos columns
  result <- array(NA, dim=c(length(hapresults), ncol(mrkDosage), 3),
                  dimnames=list(names(hapresults), colnames(mrkDosage),
                                c("mrkNA", "hapNA", "match")))
  for (hb in names(hapresults)) {
    mrkdosobs <- mrkDosage[hblist[[hb]],, drop=FALSE]
    nomrk <- colSums(!is.na(mrkdosobs)) == 0
    mrkdoshap <- hapdos2mrkdos(hapdos=hapresults[[hb]]$hapdos,
                               allhap=allHaplotypes(hblist[[hb]]))
    nohap <- colSums(hapresults[[hb]]$hapdos, na.rm=TRUE) != ploidy
    conflict <- mrkdoshap != mrkdosobs
    result[hb,, "mrkNA"] <- nomrk
    result[hb,, "hapNA"] <- nohap
    m <- !nomrk & !nohap & colSums(!is.na(conflict)) > 0
    result[hb, m, "match"] <-
      colSums(conflict[, m, drop=FALSE], na.rm=TRUE) == 0
  }
  result
} #compareHapMrkDosages

#'@title convert the PedigreeSim true haplotypes to marker dosages and
#'haplotype dosages
#'@usage pedigreeSim2PH(ps_geno, haploblock, indiv=NULL, dropUnused=TRUE)
#'@param ps_geno the filename of a *_genotypes.dat file as produced by
#'PedigreeSim, or a data.frame read from such a file (with a column marker,
#'followed by <ploidy> columns per individual with the marker alleles for each
#'homolog; the names of these columns must be <indivname>_<homolog number>).
#'ps_geno should not contain missing data (this will be true unless the
#'PedigreeSim output is midified)
#'@param haploblock a list of character vectors. The names are the names of the
#'haploblocks, the character vectors have the names of the markers in each
#'haploblock. Haplotype names are constructed from the haploblock names, which
#'are used as prefixes to which the (zero-padded) haplotype numbers are are
#'appended with separator '_'.\cr
#'All markers in haploblock must be present in ps_geno. If haploblock is NULL
#'only the marker dosages are generated, not the haplotype dosages.
#'@param indiv the names of the individuals to be extracted; default NULL: all
#'individuals in ps_geno. If not NULL, all indiv must be present in ps_geno
#'@param dropUnused TRUE (default) if the returned matrix should only contain
#'rows for haplotypes that are present; if FALSE matrix contains rows for all
#'possible haplotypes
#'@details if all alleles are in 0/1, these are kept. If only 2 allele symbols
#'occur in ps_geno all are converted (alphabetically) to 0/1 in
#'the same way (e.g. if the alleles are A and B, A -> 0 and B -> 1, even in
#'markers with only B's). If different allele symbols are used between
#'markers, per marker the (alphabetically or numerically) lowest -> 0 and
#'the highest -> 1.
#'So if more than two different allele symbols occur in ps_geno, and
#'one marker has only A and another only B alleles, both are converted
#'to 0's.\cr
#'in mrkDosage, the dosage of the alleles (converted to) 1 is reported
#'@return a list with 2 items:\cr
#'$mrkDosage: a matrix of marker dosages in the input format of inferHaplotypes,
#'with all markers that occur in haploblock, sorted according to haploblock,
#'and for the selected indiv in the specified order\cr
#'$haplist: (only if haploblock is not NULL) a list with one element for each
#'haploblock, similar to the
#'inferHaplotypes output). Each element is itself a list, with two components:\cr
#'$hapdos is a matrix with the haplotype dosages for that haploblock for each
#'individual\cr
#'$markers:  a vector with the names of
#'the markers in the haploblock in the output of inferHaplotypes
#'@export
pedigreeSim2PH <- function(ps_geno, haploblock, indiv=NULL, dropUnused=TRUE) {
  if (is.character(ps_geno))
    ps_geno <- read.table(ps_geno[1], header=TRUE, sep="\t",
                               check.names=FALSE, stringsAsFactors=FALSE)
  if (is.data.frame(ps_geno)) {
    mrk <- ps_geno$marker
    ps_geno <- as.matrix(ps_geno[, -1])
    rownames(ps_geno) <- mrk
  }
  # now ps_geno is a matrix; check against non-biallelic markers
  # and get the allele symbols per marker:
  # (we do this over all indiv in ps_geno, not only for the selected indiv
  #  because we must know about all allele symbols per marker to be consistent
  #  between different selctions of individuals)
  alleles <- apply(ps_geno, MARGIN=1, FUN=function(x) sort(unique(x)))
  if (is.list(alleles) && any(sapply(alleles, length) > 2))
    stop("ps_geno contains markers with more than 2 alleles")
  if (is.matrix(alleles))
    alleles <- as.data.frame(alleles, stringsAsFactors=FALSE)
  all_alleles <- sort(unique(unlist(alleles)))
  # reduce ps_geno to only the requested markers and individuals:
  if (!(is.null(haploblock))) {
    if (!all(unlist(haploblock) %in% rownames(ps_geno)))
      stop("not all markers in haploblock occur in ps_geno")
    ps_geno <- ps_geno[unlist(haploblock),, drop=FALSE]
  }
  # get indiv names and ploidy and select specified indiv:
  us <- gregexpr("_", colnames(ps_geno))
  last_ <- sapply(us, FUN=function(x) x[length(x)]) # the last "_" in each indiv name
  PSindiv <- unique(substr(colnames(ps_geno), 1, last_-1))
  ploidy <- ncol(ps_geno) / length(PSindiv)
  if (is.null(indiv)) indiv <- PSindiv
  if (!all(indiv %in% PSindiv))
    stop("Not all indiv found in PedigreeSim genotypes file")
  # keep only the specified indiv in the specified order:
  indix <- match(indiv, PSindiv)
  colix <- (indix-1) * ploidy + 1 # the 1st column in ps_geno for each indiv
  colix <- rep(colix, each=ploidy) + (0:(ploidy-1))
  ps_geno <- ps_geno[, colix, drop=FALSE]
  # convert the allele symbols to 0/1:
  if (!all(all_alleles %in% 0:1)) {
    if (length(all_alleles) == 1) ps_geno[] <- 0 else {
      # all markers have the same 2 allele symbols, these are translated
      # to 0/1 identically for all markers, even in markers that have only one
      # of them
      if (length(all_alleles) ==2) {
        newalleles <- (0:1)[match(ps_geno, all_alleles)]
        newalleles <- matrix(newalleles, nrow=nrow(ps_geno),
                             dimnames=dimnames(ps_geno))
      } else {
        # allele symbols differ between markers but all are mono- or bi-allelic
        newalleles <- matrix(NA_integer_, nrow=nrow(ps_geno),
                             ncol=ncol(ps_geno),
                             dimnames=dimnames(ps_geno))
        for (m in seq_len(nrow(newalleles)))
          newalleles[m,] <- (0:1)[match(ps_geno[m,], alleles[[m]])]
      }
      ps_geno <- newalleles; rm(newalleles)
    }
  }
  rm(alleles, all_alleles)
  # get mrkDosage:
  mrkDosage <- ps_geno
  dim(mrkDosage) <- c(nrow(mrkDosage), ploidy, length(indiv))
  mrkDosage <- apply(mrkDosage, MARGIN=c(1,3), sum)
  dimnames(mrkDosage) <- list(rownames(ps_geno), indiv)
  result <- list(mrkDosage=mrkDosage)
  # get list of hapdos:
  if (!is.null(haploblock)) {
    haplist <- vector("list", length(haploblock))
    names(haplist) <- names(haploblock)
    for (hb in seq_along(haploblock)) {
      nhap <- 2^length(haploblock[[hb]])
      dat <- ps_geno[haploblock[[hb]],, drop=FALSE]
      hapnrs <- mrkdos2mrkdid(mrkDosage=dat, indiv=NULL, ploidy=1, check=FALSE)
      #hapnrs is now a vector of hapnrs, of length ploidy * length(indiv)
      hapdos <- hapcomb2hapdos(hapcomb=matrix(hapnrs, nrow=ploidy, byrow=FALSE),
                               nhap=nhap)
      rownames(hapdos) <- paste0(names(haploblock)[hb], "_", padded(1:nhap))
      colnames(hapdos) <- indiv
      if (dropUnused) {
        hapdos <-
          hapdos[rowSums(hapdos, na.rm=TRUE) > 0, , drop=FALSE]
      }
      haplist[[hb]] <- list(hapdos=hapdos, markers=haploblock[[hb]])
    }
    result$haplist <- haplist
  }
  result
}

Try the PolyHaplotyper package in your browser

Any scripts or data that you put into this service are public.

PolyHaplotyper documentation built on June 17, 2021, 5:12 p.m.