R/fitPolyTools.R
In fitPoly: Genotype Calling for Bi-Allelic Marker Assays

Documented in calcRstats calcSegtypeInfo checkF1 checkFilename combineFiles compareProbes concatbatch correctDosages drawXYplots expandUnknownParents F1Dosages2Matrix getBatchFiles get.genocol leftstr listSegtypes makeFitPolyFiles readAxiomSummary readDatfile readFullDataTable removeRedundant rightstr samplestats scores2wide segtypeInfoSummary selMarkers_byR selMarkers_qall selSegtypeInfo splitNrenameSamples writeDatfile writeDosagefile XY_plot

#'fitPolyTools: a package with functions related to package fitPoly
#'
#'Package fitPolyTools offers some functions for preparing and selecting
#'input data for package fitPoly, and for inspection and analysis
#'of polyploid dosage calls as produced by fitPoly. These analyses focus
#'on the segregation of dosages in an F1 progeny of two parents. The functions
#'can be grouped as follows:
#'
#'@section data preparation functions:
#'Functions readFullDataTable and readAxiomSummary convert files generated by
#'Illumina's GenomeStudio and Affymetrix Power Tools software to the input
#'format for fitPoly::fitPoly and fitPoly::fitMarkers.
#'
#'@section data visualization functions:
#'Functions XYplot and drawXYplots draw scatterplots of the samples for one
#'marker at a time, possibly with colors indicating the assigned dosages and/or
#'highlighting specific samples. Function combineFiles combines X and Y columns
#'from one file or data.frame with column geno from another.
#'
#'@section segregation type functions:
#'The functions calcSegtypeInfo, selSegtypeInfo and segtypeInfoSummary
#'produce information on the possible F1 segregation types. These functions
#'are primarily developed for use in other functions but are also available
#'for users.
#'
#'@section identification of the probable segregation types:
#'Function checkF1 gives for all markers the most likely (best fitting)
#'segregation type, based on the F1 and the parental samples. Function
#'correctDosages checks for possibly "shifted" markers (e.g. all dosages
#'scored one lower or higher than the true dosage).
#' Function checkF1 can apply shifts when specified.
#'
#'@section combining data from two probes for the same SNP:
#'On Affymetrix Axiom SNP arrays it is possible to have the same SNP
#'interrogated from both sides by two different probes. Function compareProbes
#'compares the results from both probes and produces a new, merged marker if
#'they are sufficiently similar. It also writes the results to a data file
#'suitable for linkage mapping functions (package polymapR).\cr
#'Function removeRedundant takes the result of compareProbes; if a merged
#'marker is present it removes the corresponding separate probe markers.
#'
#'@section producing a data file for mapping:
#'Function writeDosagefile produces a data file suitable for linkage mapping
#'functions (package polymapR). This data file has the same format as that
#'produced by compareProbes and removeRedundant (see previous section).\cr
#'Function F1Dosages2Matrix reads such a dosages file and returns a matrix
#'in the format required by the polymapR package.
#'
#'@section reformatting of fitPoly scores file:
#'Function scores2wide reformats the scores file produced by the
#'fitMarkers function of the fitPoly package to "wide" format.
#'
#'@section recovery functions:
#'Functions getBatchfiles and concatbatch allow to recover results from already
#'processed batches of markers, in case fitMarkers crashes.
#'
#'@section general tools:
#'These tools may be useful by themselves and are therefore made available too.
#'They include leftstr and rightstr: two functions that return substrings from
#'the left or right side, and function checkFilename which tests if a file
#'with the name can be created. Functions readDatfile and writeDatfile are
#'wrappers for read.table and write.table with different default options,
#'suitable for the file formats expected or produced by functions in this
#'package oand fitPoly.
#'
#'@docType package
#'@name fitPolyTools
NULL

# Roeland Voorrips, Wageningen UR - Plant Breeding

# fitPolyTools contains functions for
# - conversion of data files from array software formats to fitPoly input format
# - drawing XY-scatterplots with or without genotype dosage colors
# - checking dosage scores against expected F1 segregation patterns
# - detecting and correcting "shifted" markers
# - combining scores from two probes (assays) of the same SNP
# - writing genotype files for F1 populations
# - recovering data from temporary files of fitPoly::fitMarkers
#   in case it crashes

#'@importFrom grDevices col2rgb dev.off dev.list png rgb
#'@importFrom graphics layout lines plot points
#'@importFrom stats chisq.test pbinom quantile
#'@importFrom utils read.table write.table

#'@title A data set containing SNP array data
#'@description A data frame containing SNP array signal intensity data
#'@details XYdat contains the following columns:
#'\itemize{
#'  \item MarkerName: factor, names of the markers (SNPs)
#'  \item SampleName: factor names of the individuals / DNA samples
#'  \item X: numeric, the signal intensities of the X allele
#'  \item Y: numeric, the signal intensities of the Y allele
#'  \item R: numeric, the total signal intensities; R = X + Y
#'  \item ratio: numeric, the ratios of the Y signal to the total signal
#'  intensity; ratio = Y / R
#'}
#'@name XYdat
#'@usage data(XYdat)
#'@docType data
NULL

#'@title A data set with dosage scores generated by fitPoly
#'@description A data frame as saved by function fitMarkers in package
#'fitPoly
#'@details scores contains the following columns:
#'\itemize{
#'  \item marker: numeric, the sequential number of the markers in the
#'  fitMarkers input file
#'  \item MarkerName: factor, names of the markers (SNPs)
#'  \item SampleName: factor names of the individuals / DNA samples
#'  \item ratio: numeric, the ratios of the Y signal to the total signal
#'  intensity
#'  \item P0 ... P4: numeric, the probabilities that the observed signal ratio
#'  reflects a true Y allele dosage of 0 ... 4, according to the selected
#'  mixture model
#'  \item maxgeno: numeric, the dosage with the maximum probability
#'  \item maxP: numeric, the maximum probability, i.e. the probability of maxgeno
#'  \item geno: numeric, equal to maxgeno if maxP was above the specified threshold
#'  (p.threshold=0.99), else NA
#'}
#'@name scores
#'@usage data(scores)
#'@docType data
NULL


# Read data from SNP arrays **************************************************
#'@title convert a GenomeStudio FullDataTable file to the import format
#'for fitPoly
#'@description A GenomeStudio file in wide format (samples side-by-side) is
#'converted to a fitPoly input file in long format
#'@usage readFullDataTable(filename, rawXY=FALSE,
#'markergroups=list(), out, filetype=c("dat","RData")[2])
#'@param filename name of a FullDataTable tab-separated text file exported from
#'Illumina's GenomeStudio. The file must contain a column "Name" with the
#'marker names, and for each sample a pair of columns "sample.X" and "sample.Y"
#'if rawXY is FALSE, or "sample.X raw" and "sample.Y raw" (note the space)
#'if rawXY is TRUE. Further columns may be present but are not read.
#'@param rawXY if FALSE (default) the normalized .X and .Y columns are read;
#'if TRUE the "raw" columns (.X raw and .Y raw) are read instead.
#'@param markergroups a list with character vectors of marker names, or integer
#'vectors of marker numbers in file order. If the data set is large, the
#'conversion to long format may exceed memory limits. In these cases the data
#'can be split into marker groups that are converted separately and each saved
#'to a separate file. If the list is empty (default) all markers are converted
#'as one block.
#'@param out the name of an output file (without extension). If a list of
#'markergroups is given, out must be a valid file name (without extension);
#'in that case multiple output files are created with filenames in which
#'the list element numbers are appended to out. If no markergroups are specified
#'out may also be set to "" or NA; in that case no file is created and the
#'converted data are only returned as function result.\cr
#'If out is not "" or NA, then also a file <out>_meanR.dat is saved with for
#'all samples their mean R value and  number of missing data (over all markers)
#'@param filetype either "dat" or "RData" (default): the former produces
#'tab-separated text files, the latter saves RData files with the converted
#'data in a data frame with name "dat".
#'@details The wide-format input is converted to a long-format form with
#'columns MarkerName, SampleName, X, Y, R (= X + Y) and ratio (= Y / R).
#'The X and Y signal intensities are obtained from the <sample>.X and <sample>.Y
#'columns in the input data (or from the <sample>.X raw and <sample>.Y raw
#'columns if rawXY is TRUE). R and ratio are calculated from these values
#'and not read from the input data.
#'@return If no markergroups are specified, a data.frame is returned with
#'columns MarkerName, SampleName, X and Y (also if raw data are read, the
#'column names are X and Y), R (= X + Y), ratio (= Y / (X+Y) ).\cr
#'If a list of markergroups is specified the function result is NULL
#'and the converted data.frames are only saved as files.\cr
#'If the saved files are RData files, they all contain one data.frame named
#'"dat".
#'@export
readFullDataTable <- function(filename, rawXY=FALSE,
                              markergroups=list(),
                              out, filetype=c("dat","RData")[2]) {
  #Here, in contrast to readAxiomSummary, we don't allow to pass a data.frame
  #instead of the filename, because we need to decide which columns to read

  #test input:
  if (!is.list(markergroups)) {
    stop("markergroups is not list")
  }
  if (missing(out)) stop("error in readFullDataTable: out must be specified")
  if (length(out) != 1) stop("out must be one filename or NA")
  if (!is.na(out)) {
    if (!is.character(out)) {
      stop("out must be one filename")
    }
    if (is.na(out) || out == "") {
      out <- NA
      if (length(markergroups) > 0) {
        stop("when markergroups is not empty, out must be a valid filename")
      }
    } else {
      if (!check.filename(paste(out,"714x996y8.xxx",sep="_"))) {
        stop("out is not a valid filename")
      }
    }
  }

  filetype <- tolower(filetype)
  if (!is.na(out) && filetype != "dat" && filetype != "rdata") {
    stop("filetype must be 'dat' or 'RData'")
  }
  #first we read the caption line, to select the columns to read and to
  #get the unmodified sample names:
  captions <- scan(file=filename, what="character", allowEscapes=F, nlines=1,
                   quiet=TRUE, sep="\t")
  namecol <- which(captions == "Name")
  nch <- nchar(captions)
  if (rawXY) {
    right <- substr(captions, nch-5, nch)
    Xcols <- which(right == ".X raw")
    Ycols <- which(right == ".Y raw")
    samplenames <- substr(Xcols, 1, nchar(Xcols) - 6)
    if (length(Xcols) == 0 && length(Ycols) == 0) {
      #try again without space
      right <- substr(captions, nch - 4, nch)
      Xcols <- which(right == ".Xraw")
      Ycols <- which(right == ".Yraw")
      samplenames <- substr(Xcols, 1, nchar(Xcols) - 5)
    }
    if (length(Xcols) == 0 || length(Ycols) == 0 ||
        length(Xcols) != length(Ycols)) {
      stop("for each sample columns 'sample.X raw' and 'sample.Y raw' must be present")
    }
  } else {
    right <- substr(captions, nch-1, nch)
    Xcols <- which(right == ".X")
    Ycols <- which(right == ".Y")
    samplenames <- substr(captions[Xcols], 1, nch[Xcols] - 2)
    if (length(Xcols) == 0 || length(Ycols) == 0 ||
        length(Xcols) != length(Ycols)) {
      stop("for each sample columns 'sample.X' and 'sample.Y' must be present")
    }
  }
  minXY <- min(Ycols - Xcols)
  maxXY <- max(Ycols - Xcols)
  if (maxXY != minXY) {
    stop("columns must be in the same order for each sample")
  }
  if (minXY > 0) XY <- c("X", "Y") else XY <- c("Y", "X")
  #next we read the selected columns:
  colClasses <- rep("NULL", length(captions))
  colClasses[namecol] <- "character"
  colClasses[c(Xcols, Ycols)] <- "numeric"
  dat_wide <- readDatfile(filename, colClasses=colClasses)
  dat_wide <- dat_wide[dat_wide[1] != "",]
  #convert to long format:
  Rfile <- paste(out,"meanR.dat", sep="_")
  na.count <- function(x) { sum(is.na(x)) }
  if (length(markergroups) == 0) {
    #no groups, all markers in one go:
    result <- wide2longGS(GSdata=dat_wide, samplenames, XY,
                          out, rdata=(filetype=="rdata"))
    if (!is.na(out)) {
      sumsR <- tapply(result$R, INDEX=result$SampleName, FUN=sum, na.rm=TRUE)
      nasR <- tapply(result$R, INDEX=result$SampleName, FUN=na.count)
      meanR <- data.frame(SampleName=names(sumsR),
                         meanR=sumsR / (nrow(dat_wide) - nasR),
                         missing=nasR)
      writeDatfile(meanR, Rfile)
    }
    return(invisible(result))
  } else {
    #convert per marker group:
    sumsR <- NULL
    for (mg in seq_along(markergroups)) {
      if (is.numeric(markergroups[[mg]])) {
        #markergroups[[mg]] has the row numbers of this group
        subdat <- dat_wide[markergroups[[mg]],]
      } else {
        #markergroups[[mg]] has the marker names of this group
        subdat <- dat_wide[dat_wide[[1]] %in% markergroups[[mg]],]
      }
      fname <- paste(out, padded(mg, length(markergroups)), sep="_")
      dat <- wide2longGS(GSdata=subdat, samplenames, XY,
                         fname, rdata=(filetype=="rdata"))
      if (is.null(sumsR)) {
        sumsR <- tapply(dat$R, INDEX=dat$SampleName, FUN=sum, na.rm=TRUE)
        nasR <- tapply(dat$R, INDEX=dat$SampleName, FUN=na.count)
      } else {
        sumsR <- rowSums(cbind(sumsR,
                               tapply(dat$R, INDEX=dat$SampleName, FUN=sum, na.rm=TRUE)),
                         na.rm=TRUE)
        nasR <- nasR + tapply(dat$R, INDEX=dat$SampleName, FUN=na.count)
      }
    }
    meanR <- data.frame(SampleName=names(sumsR),
                        meanR=sumsR/(length(unlist(markergroups))-nasR),
                        missing=nasR)
    writeDatfile(meanR, Rfile)
    if (length(result) > 0) {
      if (!is.null(names(markergroups))) {
        names(result) <- names(markergroups)
      }
      return(invisible(result))
    } else return(invisible(NULL))
  }
} #readFullDataTable

wide2longGS <- function(GSdata, samplenames, XY,
                        outputfilename, rdata) {
  #helper function for readFullDataTable, converts one wide data frame to
  #one long data frame.
  #GSdata: a wide data frame from a GenomeStudio FullDataTable, containing
  #only the markers in this group with X and Y signal intensities (marker names
  #in column 1, 2 columns with X and Y signals per sample)
  #samplenames: all sample names in order of columns of GSdata
  #XY: c("X", "Y") or c("Y", "X"), depending on order per sample
  #outputfilename: the complete output file name, or NA
  #rdata: TRUE or FALSE
  #Function result: the same data as a long data frame, with columns R and ratio

  #we want:
  #MarkerName SampleName   X   Y
  #M1         S1         111 112
  #M2         S1         211 212
  #M3         S1         311 312
  #M1         S2         121 122
  #M2         S2         221 222
  #M3         S2         321 322

  #from GenomeStudio we have:
  #marker S1.X S1.Y S2.X S2.Y
  #M1      111  112  121  122
  #M2      211  212  221  222
  #M3      311  312  321  322

  #so we take 2 columns at a time and put them under each other
  #(Note that the result of wide2longGS is sorted first by SampleName
  # and within sample by MarkerName, in the result of wide2longAxiom
  # it is the other way round)

  markernames <- GSdata[,1]
  GSdata <- as.matrix(GSdata[,-1])
  nmrk <- nrow(GSdata)
  nsamp <- ncol(GSdata) / 2
  if (length(samplenames) != nsamp) stop("wide2longGS: samplenames don't match GSdata")
  dat <- matrix(0.0, nrow=nmrk * nsamp, ncol=2)
  if (nmrk > 0 && nsamp > 0) {
    for (s in 1:nsamp) {
      dat[(1 + (s-1) * nmrk) : (s * nmrk),] <-
        GSdata[, (1 + (s-1) * 2) : (s * 2)]
    }
  }
  dat <- data.frame(
    MarkerName=rep(markernames, times=nsamp),
    SampleName=rep(samplenames, each=nmrk),
    dat)
  #if Y appears before X, exchange:
  names(dat)[3:4] <- XY
  if (XY[2] == "X") dat <- dat[,c(1, 2, 4, 3)]
  dat$R <- dat$X + dat$Y
  dat$ratio <- dat$Y / dat$R
  #write file:
  if (!is.na(outputfilename)) {
    if (rdata) {
      save(dat, file=paste(outputfilename, ".RData", sep=""))
    } else {
      writeDatfile(dat, file=paste(outputfilename, ".dat", sep=""))
    }
  }
  dat
} #wide2longGS

#'@title convert an Affymetrix AxiomCT1.summary file to the import format
#'for fitPoly
#'@description An Affymetrix AxiomCT1.summary.txt file in wide format (samples
#'side-by-side) is converted to a fitPoly input file in long format
#'@usage readAxiomSummary(AXdata, markergroups=list(),
#'out, filetype=c("dat", "RData")[2])
#'@param AXdata name of an AxiomCT1.summary.txt tab-separated text file
#'exported from Affymetrix Power Tools (APT); or a data.frame in the same
#'format:\cr
#'the file may start with a number of comment lines (starting with "#")
#'followed by a header line of which the first caption is "probeset_id" and all
#'other captions are sample names, all ending with ".CEL". For each marker there
#'are two lines, with probeset_id ending in "-A" and "-B"; the two lines for
#'each marker are consecutive. In addition there may be any number of control
#'markers, for each of which there is only one line, with probeset_id ending
#'in "-A".
#'@param markergroups a list with character vectors of marker names, or integer
#'vectors of marker numbers in file order. If the data set is large, the
#'conversion to long format may exceed memory limits. In these cases the data
#'can be split into marker groups that are converted separately and each saved
#'to a file (or returned as list items, but if that is possible splitting into
#'groups would not be needed)
#'@param out the name of an output file (without extension). If a list of
#'markergroups is given, out must be a valid file name (without extension);
#'in that case multiple output files are created with filenames in which
#'the list element numbers are appended to out. If no markergroups are specified
#'out may also be set to "" or NA; in that case no file is created and the
#'converted data are only returned as function result.\cr
#'If out is not "" or NA, then also a file <out>_meanR.dat is saved with for
#'all samples their mean R value and  number of missing data (over all markers)
#'@param filetype either "dat" or "RData" (default): the former produces
#'tab-separated text files, the latter saves RData files with the converted
#'data in a data frame with name "dat"
#'@details The wide-format input is converted to a long-format form with
#'columns MarkerName, SampleName, X, Y, R (= X + Y) and ratio (= Y / R).
#'The X and Y signal intensities are obtained from the <marker>-A and <marker>-B
#'rows in the input data, R and ratio are calculated from these values.
#'@return If no markergroups are specified, a data.frame is returned with
#'columns MarkerName, SampleName, X, Y, R (= X + Y), ratio (= Y / (X+Y) ).\cr
#'If a list of markergroups is specified the function result is NULL
#'and the converted data.frames are only saved as files.\cr
#'If the saved files are RData files, they all contain one data.frame named
#'"dat".
#'@export
readAxiomSummary <- function(AXdata,
                             markergroups=list(),
                             out,
                             filetype=c("dat", "RData")[2]) {
  #test input:
  if (!is.list(markergroups)) {
    stop("markergroups is not list")
  }
  if (missing(out)) stop("out must be specified")
  if (length(out) != 1) stop("out must be one filename or NA")
  if (!is.na(out)) {
    if (!is.character(out)) {
      stop("out must be one filename")
    }
    if (is.na(out) || out == "") {
      out <- NA
    } else {
      if (!check.filename(paste(out,"714x996y8.xxx",sep="_"))) {
        stop("out is not valid")
      }
    }
  }
  filetype <- tolower(filetype)
  if (!is.na(out) && filetype != "dat" && filetype != "rdata") {
    stop("error in readAxiomSummary: filetype must be 'dat' or 'RData'")
  }
  if (is.character(AXdata)) {
    #filename: read file, trim ".CEL" from sample names
    #find the caption line and read sample names
    #(dashes and other unallowed characters in names are converted to dots in
    #read.table)
    con <- file(AXdata, "r")
    repeat {
      s <- readLines(con, n=1)
      if (is.null(s) || length(s)==0) {
        stop("could not read AXdata file, or end reached without finding caption line")
      }
      s <- strsplit(s, split="\t")
      if (length(s) > 0) {
        s <- gsub("(^ +)|( +$)", "", s[[1]], fixed=TRUE)
        #    a.o. remove leading & trailing whitespace
        if (nchar(s[1]) > 0 && substr(s[1], 1, 1) != "#") break;
      }
    }
    close(con)
    if (s[1] != "probeset_id") {
      stop("caption line doesn't start with probeset_id")
    }
    #remove the .CEL suffixes from the sample names if they are there:
    samplenames <- s[2:length(s)]
    nc <- nchar(samplenames)
    cel <- substr(samplenames,nc-3, nc) == ".CEL"
    samplenames[cel] <- substr(samplenames[cel], 1, nc-4)

    #read the total file, skipping the initial comment lines:
    #knowing markercount does not help because of the presence of very many
    #control markers (DQC) after the last real marker
    dat_wide <- read.table(AXdata, comment.char="#", sep="\t", header=TRUE,
                           nrows=-1)
    dat_wide[,1] <- as.character(dat_wide[,1])
    dat_wide <- dat_wide[dat_wide[,1]!="",]
    nc <- nchar(dat_wide$probeset_id)
    if (anyNA(nc) || any(nc < 3))
      stop("some data lines have a probeset_id shorter than 3 characters")
    lastchar <- substring(dat_wide$probeset_id, nc)
    # old version:
    #  # exclude the DQC lines (the block of lines with probeset_id's all ending
    #  # with "-A"):
    #  dif <- lastchar[1:(length(lastchar)-1)] != lastchar[2:length(lastchar)]
    #  # at the bottom: if ending in -A it is a DQC, else the B-line of a real mrk:
    #  dif[length(lastchar)] <- lastchar[length(lastchar)] == "B"
    #  # when a block of DQC follows a block of real markers: ok
    #  # when a block of real markers follows a block of DQC: also ok
    #  dat_wide <- subset(dat_wide, substr(probeset_id, 1, 4) != "DQC-")
    #  dat_wide <- dat_wide[dif,]; lastchar <- lastchar[dif]#
    # new version 20190221: don't assume that the DQC lines are all at the
    # bottom;
    # instead, look for all probeset_id with -B that have the same probeset_id
    # with -A on the preceding line:
    Blines <- setdiff(which(lastchar == "B"), 1)
    Alines <- Blines - 1
    Alines <- Alines[lastchar[Alines] == "A"]
    Blines <- Alines + 1
    probnames <- substr(dat_wide$probeset_id, 1, nc-2) # without -A or -B suffix
    if (!all(probnames[Alines] == probnames[Blines]))
      stop("some A/B line pairs have a different probeset name")
    dat_wide <- dat_wide[sort(c(Alines, Blines)),]
    dat_wide <- dat_wide[substr(dat_wide$probeset_id, 1, 4) != "DQC-",]
    rm(lastchar, probnames, Alines, Blines)
  } else if (is.data.frame(AXdata)) {
    if (names(AXdata)[1] != "probeset_id")
      stop("invalid AXdata data frame")
    #have (only) the correct rows been read?
    ok <- FALSE
    if (nrow(AXdata) %% 2 == 0) {
      nm <- as.character(AXdata[,1]); nc <- nchar(nm)
      arows <- seq(1, nrow(AXdata), 2)
      ok <- all(substr(nm[arows],nc[arows], nc[arows]) == "A") &&
        all(substr(nm[arows+1], nc[arows+1], nc[arows+1]) == "B")
      rm(nm, arows)
    }
    if (!ok) stop("AXdata should only contain alternating A and B rows")
    dat_wide <- AXdata
    samplenames <- names(dat_wide)[-1]
  } else stop("readAxiomSummary: AXdata must be file name or data frame")

  #test the markernames:
  if (nrow(dat_wide) %% 2 != 0) {
    stop("readAxiomSummary: odd number of marker lines")
  }
  nmrk <- nrow(dat_wide) / 2
  anames <- as.character(dat_wide[,1][rep(c(TRUE, FALSE), times=nmrk)])
  nch <- nchar(anames)
  err <- which(substr(anames, nch-1, nch) != "-A")
  if (length(err) != 0) stop("readAxiomSummary: not all odd probeset-id end with '-A'")
  anames <- substr(anames, 1, nch-2) #strip off the "-A"
  bnames <- as.character(dat_wide[,1][rep(c(FALSE, TRUE), times=nmrk)])
  nch <- nchar(bnames)
  err <- which(substr(bnames, nch-1, nch) != "-B")
  if (length(err) != 0) stop("readAxiomSummary: not all even probeset-id end with '-B'")
  bnames <- substr(anames, 1, nch-2) #strip off the "-B"
  err <- which (anames != bnames)
  if (length(err) != 0) stop("readAxiomSummary: not all even probeset-id's are paired")
  markernames <- anames; rm(nch, err, anames, bnames)

  Rfile <- paste(out,"meanR.dat", sep="_")
  na.count <- function(x) { sum(is.na(x)) }

  #convert to long format:
  #samplenames <- names(dat_wide[2:length(dat_wide)])
  if (length(markergroups) == 0) {
    #no groups, all markers in one go:
    result <- wide2longAxiom(AXdata=dat_wide, markernames, samplenames,
                             out, rdata=(filetype=="rdata"))
    if (!is.na(out)) {
      sumsR <- tapply(result$R, INDEX=result$SampleName, FUN=sum, na.rm=TRUE)
      nasR <- tapply(result$R, INDEX=result$SampleName, FUN=na.count)
      meanR <- data.frame(SampleName=names(sumsR),
                          meanR=sumsR/(length(markernames)-nasR),
                          missing=nasR)
      writeDatfile(meanR, Rfile)
    }
    return(invisible(result))
  } else {
    #convert per marker group:
    sumsR <- NULL
    result <- list()
    for (mg in seq_along(markergroups)) {
      if (is.numeric(markergroups[[mg]])) {
        #markergroups[[mg]] has the row numbers of this group
        mrknrs <- sort(c(2 * markergroups[[mg]] - 1,
                         2 * markergroups[[mg]]))
        subdat <- dat_wide[mrknrs,]
        submarkernames <- markernames[sort(markergroups[[mg]])]
      } else {
        #markergroups[[mg]] has the marker names of this group
        mrknrs <- which(markernames %in% markergroups[[mg]])
        submarkernames <- markernames[mrknrs] #now in order of markernames
        mrknrs <- sort(c(2 * mrknrs - 1, 2 * mrknrs))
        subdat <- dat_wide[mrknrs,]
      }
      fname <- paste(out, padded(mg, length(markergroups)), sep="_")
      dat <- wide2longAxiom(AXdata=subdat, submarkernames, samplenames,
                            fname, rdata=(filetype=="rdata"))
      if (is.null(sumsR)) {
        sumsR <- tapply(dat$R, INDEX=dat$SampleName, FUN=sum, na.rm=TRUE)
        nasR <- tapply(dat$R, INDEX=dat$SampleName, FUN=na.count)
      } else {
        sumsR <- rowSums(cbind(sumsR,
                               tapply(dat$R, INDEX=dat$SampleName, FUN=sum, na.rm=TRUE)),
                         na.rm=TRUE)
        nasR <- nasR + tapply(dat$R, INDEX=dat$SampleName, FUN=na.count)
      }
    }
    meanR <- data.frame(SampleName=names(sumsR),
                        meanR=sumsR/(length(unlist(markergroups))-nasR),
                        missing=nasR)
    writeDatfile(meanR, Rfile)
    return(invisible(NULL))
  }
} #readAxiomSummary


wide2longAxiom <- function(AXdata, markernames, samplenames,
                           outputfilename, rdata) {
  #helper function for readAxiomSummary, converts one wide data frame to
  #one long data frame.
  #AXdata: data frame or matrix in wide format. If a data frame, then in format
  #      of an AxiomGT1.summary file, containing probeset_id in first column
  #      and samples in the other columns; for each probe only the markers
  #      in this group with X and Y signal intensities (2 rows per
  #      SNP, probeset_id-A or -B in column 1 (not used, use markernames instead)
  #      If matrix: same format but without the probeset_id column
  #samplenames: all sample names in order of columns of AXdata
  #outputfilename: the complete output file name, or NA
  #rdata: TRUE or FALSE
  #Function result: the same data as a long data frame, with columns R and ratio

  #we want:
  #MarkerName SampleName   X   Y
  #M1         S1         111 112
  #M1         S2         121 122
  #M2         S1         211 212
  #M2         S2         221 222
  #M3         S1         311 312
  #M3         S2         321 322

  #from Axiom we have:
  #probeset_id  S1   S2
  #M1-A        111  121
  #M1-B        112  122
  #M2-A        211  221
  #M2-B        212  222
  #M3-A        311  321
  #M3-B        312  322

  #so we take 2 rows at a time, t(ranspose) them and put them under each other
  #(Note that the result of wide2longAxiom is sorted first by MarkerName
  # and within marker by SampleName, in the result of wide2longGS
  # it is the other way round)

  if (is.data.frame(AXdata))
    AXdata <- as.matrix(AXdata[,-1])
  nmrk <- nrow(AXdata) / 2
  nsamp <- ncol(AXdata)
  if (length(markernames) != nmrk) stop("wide2longAxiom: markernames don't match AXdata")
  if (length(samplenames) != nsamp) stop("wide2longAxiom: samplenames don't match AXdata")
  dat <- matrix(0.0, nrow=nmrk * nsamp, ncol=2)
  colnames(dat) <- c("X", "Y")
  if (nmrk > 0 && nsamp > 0) {
    for (r in 1:nmrk) {
      dat[(1 + (r-1) * nsamp) : (r * nsamp),] <-
        t(AXdata[(1 + (r-1) * 2) : (r * 2), ])
    }
  }
  dat <- data.frame(
    MarkerName=rep(markernames, each=nsamp),
    SampleName=rep(samplenames, times=nmrk),
    dat)
  dat$R <- dat$X + dat$Y
  dat$ratio <- dat$Y / dat$R
  #write file:
  if (!is.na(outputfilename)) {
    if (rdata) {
      save(dat, file=paste(outputfilename, ".RData", sep=""))
    } else {
      writeDatfile(dat, file=paste(outputfilename, ".dat", sep=""))
    }
  }
  dat
} #wide2longAxiom

removeNArows <- function(x, column, blanks=TRUE) {
  #x is a data.frame, column is one column number
  #return value: the same data.frame with the rows removed where the
  #column has NA or (if blanks is TRUE) only whitespace.
  #If this column is a factor or character, it will be returned as
  #a character with no starting or trailing whitespace
  if (blanks) {
    if (is.factor(x[,column])) x[,column] <- as.character(x[,column])
    if (is.character(x[,column])) {
      x[,column] <- gsub("(^ +)|( +$)", "", x[,column], fixed=TRUE)
      x[x[,column] == "", column] <- NA
    }
  }
  x[!is.na(x[,column]),]
}

check.sampletable <- function(sampletable,
                              SampleID, CustomerID, Ploidy,
                              SampleID_levels) {
  #used by splitNrenameSamples
  #sampletable: a data frame with at least the columns for SampleID and
  #             CustomerID, and for Ploidy if this is not NULL
  #SampleID, CustomerID, Ploidy: the actual names of these columns in
  #          sampletable, or NULL (only allowed for Ploidy)
  #SampleID_levels: the levels of SampleID that occur in the marker data;
  #                 all of these must also occur in the SampleID column of
  #                 sampletable
  #return value: a string (an error message) if there are problems,
  #              else a vector with the column numbers of SampleID, CustomerID,
  #              Ploidy in sampletable
  sampidcol <- which(names(sampletable) == SampleID)
  if (length(sampidcol) != 1)
    return("sampletable should have one SampleID column")
  custidcol <- which(names(sampletable) == CustomerID)
  if (length(custidcol) != 1)
    return("sampletable should have one CustomerID column")
  #do all sample names in marker data also occur in SampleID column of sampletable?
  SampleID_levels <- as.character(SampleID_levels)
  sampletable[,sampidcol] <- as.character(sampletable[,sampidcol])
  if (anyNA(match(SampleID_levels, sampletable[, sampidcol])))
    return("not all SampleNames in dat occur in sampletable column SampleID")
  if (is.null(Ploidy)) {
    ploidycol <- NA
  } else {
    ploidycol <- which(names(sampletable) == Ploidy)
    if (length(ploidycol) != 1)
      return("sampletable should have one Ploidy column")
    sampletable <- removeNArows(sampletable, ploidycol)
  }
  if (sum(is.na(sampletable[, c(sampidcol, custidcol)])) > 0)
    return("missing data in SampleID or CustomerID columns")
  #are all SampleID and CustomerID for sampletable unique?
  sampletable[, custidcol] <- as.character(sampletable[, custidcol])
  if (length(unique(sampletable[, sampidcol])) != nrow(sampletable))
    return("duplicate names in SampleID column")
  if (length(unique(sampletable[, custidcol])) != nrow(sampletable))
    return("duplicate names in CustomerID column")
  result <- c(sampidcol, custidcol, ploidycol)
  names(result) <- c("sampidcol", "custidcol", "ploidycol")
  result
} #check.sampletable

#'@title Rename samples from array codes to user codes, and split diploid
#'from polyploid samples
#'@description After conversion of array data to fitPoly format, this function
#'replaces the array codes for the samples to user codes, and it splits the
#'data into a part for diploid and one for polyploid samples if both are present.
#'@usage splitNrenameSamples(dat, sampletable, SampleID, CustomerID,
#'Ploidy=NULL, out, filetype=c("dat","RData")[2])
#'@param dat a data frame in "long" format as returned by readFullDataTable or
#'readAxiomSummary
#'@param sampletable a data frame with at least the columns for SampleID,
#'CustomerID and Ploidy (these columns may have any name, specified by the
#'next parameters)
#'@param SampleID the title of the column in sampletable containing the codes
#'for the samples in the array file - there must be a one-to-one relation
#'between the array codes and the user codes!
#'@param CustomerID the title of the column in sampletable containing the
#'user codes for the samples - there must be a one-to-one relation between the
#'array codes and the user codes!
#'@param Ploidy the title of the column in sampletable containing the ploidy
#'of the samples; default NULL means no split based on ploidy must be done. If not
#'NULL, for each different ploidy a separate data.frame will be created and
#'samples with ploidy missing or "" will be omitted from all files.
#'@param out the base filename of the output files; to this will be appended the
#'the ploidy level(s) if Ploidy is not NULL, and an extension .dat or .RData,
#'depending on filetype). Setting out to NA or "" results in no output file(s)
#'being created
#'@param filetype either "dat" or "RData" (default): the former produces tab-
#'separated text files (one for each element of the return value),
#'the latter saves one RData file containing a list named dat, identical
#'to the return value
#'@return if Ploidy not specified, the original dat data.frame with substituted
#'SampleNames; if Ploidy specified a list with one
#'element for each different value in the Ploidy column; the names of the
#'elements are then these ploidy values. Each element is a subset of the
#'original dat data.frame containing only the rows with samples of that ploidy,
#'and with substituted SampleNames
#'@export
splitNrenameSamples <- function(dat, sampletable,
                                SampleID, CustomerID, Ploidy=NULL,
                                out, filetype=c("dat","RData")[2]) {
  if (!(leftstr(tolower(filetype), 3) %in% c("dat", "rda")))
    stop("invalid filetype")
  samplecols <-
    check.sampletable(sampletable, SampleID, CustomerID, Ploidy,
                      SampleID_levels=as.character(unique(dat$SampleName)))
  if (is.character(samplecols))
    stop(samplecols)
  sampcol <- samplecols[1]; customercol <- samplecols[2]; ploidycol <- samplecols[3]
  if (missing(out)) stop("out must be specified")
  if (length(out) != 1) stop("out must be one filename or NA")
  if (!is.na(out)) {
    if (!is.character(out)) {
      stop("out must be one filename")
    }
    if (is.na(out) || out == "") {
      out <- NA
    } else {
      if (!check.filename(paste(out,"714x996y8.xxx",sep="_"))) {
        stop("out is not valid")
      }
    }
  }

  if (!is.na(ploidycol)) {
    sampletable <- removeNArows(sampletable, ploidycol)
    ploi <- sort(unique(sampletable[, ploidycol]))
    nas <- anyNA(ploi)
    ploi <- ploi[!is.na(ploi)]
    cat(paste("sampletable contains ploidy levels", paste(ploi, collapse=" ")))
    if (nas) cat(" and also contains missing ploidy values")
    cat("\n")
    cat("the samples will be split per ploidy value")
    if (nas) cat(" and samples without ploidy will be omitted")
    cat("\n")
    # reply <- readline(prompt="Proceed (Y/N) ?")
    # reply <- tolower((reply))
    # if (!(reply %in% c("y", "yes"))) stop("splitNrenameSamples: user abort")
  }
  #checks against NA or duplicates in SampleID and CustomerID are already done in
  #check.sampletable, and also that all dat$SampleName values are in SampleID

  #rename the dat$SampleName from SampleID to CustomerID:
  dat$SampleName <- factor(dat$SampleName)
  sampnames <- levels(dat$SampleName)
  levels(dat$SampleName) <-
    sampletable[match(sampnames, sampletable[, sampcol]), customercol]

  #split different ploidy levels:
  if (is.null(Ploidy)) {
    #no split
    result <- dat
  } else {
    result <- list()
    for (p in seq_along(ploi)) {
      samp <- sampletable[sampletable[,samplecols[3]] == ploi[p], samplecols[2]]
      result[[p]] <- dat[dat$SampleName %in% samp,]
      result[[p]]$SampleName <- factor(result[[p]]$SampleName) #drop unused levels
      names(result)[p] <- ploi[p]
    }
  }

  dat <- result
  #save file(s):
  if (!is.na(out)) {
    if (leftstr(tolower(filetype), 3) == "rda") {
      save(dat, file=paste(out, "RData", sep="."))
    } else {
      for (p in seq_along(dat)) {
        fname <- paste(out, "_", names(result)[p], ".dat", sep="")
        writeDatfile(dat[[p]], fname)
      }
    }
  }
  invisible(dat)
} #splitNrenameSamples

# check distribution of R per SNP *********************************************

#'@title Calculate statistics of R per marker
#'@description Calculate the min, max, mean and 50% and 95% quantiles of R
#'for each marker
#'@usage calcRstats(datpoly, out)
#'@param datpoly a data frame in long format (input format for fitPoly)
#'with at least columns MarkerName, SampleName, R
#'@param out name of output file; if NA no file is written. File contains
#'the same data as the return value
#'@details The data.frame returned by this function is used as input for
#'function selMarkers_byR
#'@return a data frame with columns MarkerName, mean, min, q50, q95, max
#'which are all statistics of the R values per marker
#'@export
calcRstats <- function(datpoly, out) {
  stats <- function(x) { #function stats nested within calcRstats
    y <- x[!is.na(x)]
    result <- c(mean(y), quantile(y, probs=c(0, 0.50, 0.95, 1), names=FALSE))
  }

  Rdist <- tapply(datpoly$R, INDEX=datpoly$MarkerName, FUN=stats)
  Rdist <- do.call(rbind, Rdist)
  colnames(Rdist) <- c("mean", "min", "median", "q95", "max")
  Rdist <- data.frame(MarkerName=rownames(Rdist), Rdist)
  rownames(Rdist) <- NULL
  if (!is.na(out) && out!="")
    writeDatfile(Rdist, out)
  invisible(Rdist)
} #calcRstats

#'@title Select markers at specified R levels
#'@description Select markers based on their R statistics, for studying
#'the relation between R level and marker quality
#'@usage selMarkers_byR(Rstats, Rlevels, mrkperlevel=1, stat="q95")
#'@param Rstats a data frame as returned by calcRstats
#'@param Rlevels a vector of R levels in increasing order (preferably from
#'the minimum to the maximum of stat over all markers in Rstats)
#'@param mrkperlevel number, default 1: the number of markers to select at
#'each of the levels in Rlevels
#'@param stat the name of one of the statistics columns in Rstats
#'@details The return value of this function is intended to be used for
#'studying the relation between the marker quality and the value of the
#'chosen R statistic, e.g. by drawing XY-plots of each of the selected
#'markers. By finding a suitable threshold for the R statistic bas markers
#'could be excluded from evaluation by fitPoly, saving time because bad
#'markers take the longest to be scored and are then often rejected anyway.
#'@return a selection from Rstats, in ascending order of column stat.
#'The first <mrkperlevel> markers with stat above each Rlevel are returned;
#'duplicated selections are removed (so the number of returned markers may be
#'less than length(Rlevels) * mrkperlevel)
#'@export
selMarkers_byR <- function(Rstats, Rlevels, mrkperlevel=1, stat="q95") {
  #Rstats: a data frame as returned by calcRstats
  #Rlevels: a numeric vector with the levels of R at which markers should be
  #selected
  #mrkperlevel: the number of markers to select per level
  #stat: the column name of the R statistic to use
  #return value: a selection from Rstats, in ascending order of column stat.
  #              The first <mrkperlevel> markers with stat above each Rlevel
  #              are returned; duplicated selections are removed (so the
  #              number of returned markers may be less than
  #              length(Rlevels) * mrkperlevel)
  statcol <- which(names(Rstats) == stat)
  if (length(statcol) != 1)
    stop(paste("column ", stat, "missing"))
  Rstats <- Rstats[order(Rstats[,statcol]),]
  Rlevels <- sort(Rlevels)
  firstsel <- integer(0)
  for (i in 1:length(Rlevels)) {
    seli <- which(Rstats[,statcol] >= Rlevels[i])
    if (length(seli) == 0) firstsel[i] <- NA else
    firstsel[i] <- min(seli)
  }
  sel <- integer(0)
  for (i in 1:mrkperlevel) sel <- c(sel, firstsel + i - 1)
  sel <- sort(sel[!is.na(sel)])
  sel <- sel[sel <= nrow(Rstats)]
  sel <- sel[!duplicated(sel)]
  Rstats[sel,]
} #selMarkers_byR

# XY-plotting *****************************************************************

get.genocol.old <- function() {
  #return value: a set of 6 pastel colors to represent dosages,
  #from light red (nulliplex) to light blue (quadruplex) and grey for missing
  c(rgb(red=255,green=170,blue=170,maxColorValue=255), #1: light red
    rgb(red=255,green=211,blue=155,maxColorValue=255), #2: "burlywood1", lightred-orange
    rgb(red=250,green=187,blue=255,maxColorValue=255), #3: light bluish magenta
    rgb(red=152,green=245,blue=255,maxColorValue=255), #4: "cadetblue1", light bluish cyan
    rgb(red=180,green=180,blue=255,maxColorValue=255), #5: light blue
    "grey")                                            #6: NA
} #get.genocol.old

#'@title Generate a set of genotype (dosage) colors for XY-plots
#'@description For each of the dosages 0 to ploidy a color is generated,
#'and an additional one for samples with no dosage assigned.
#'@usage get.genocol(ploidy, pastel=TRUE)
#'@param ploidy any ploidy level >= 2
#'@param pastel if TRUE (default) light (pastel) colors are generated, else more
#'intense colors
#'@details the colors range from red (first item, dosage 0) via blue
#'(dosage ploidy/2) to green (dosage ploidy); the color for missing dosage is
#'grey
#'@return A vector of <ploidy + 2> color values; items 1 .. ploidy+1 for
#'dosage 0 .. ploidy and item ploidy+2 for missing dosages
#'@export
get.genocol <- function(ploidy, pastel=TRUE) {
  #return value: a set of (ploidy+1) colors to represent dosages,
  #from red (item 1, nulliplex) via blue (dosage ploidy/2) to green (item ploidy+1, dosage=ploidy), and grey (item ploidy+1) for
  #missing values
  #This order of colors was chosen because the transition colors between
  #red and green, which are not very clear, are avoided
  minint <- ifelse(pastel, 160, 0) #minimum intensity per color for pastels
  oddploidy <- ploidy %% 2 == 1
  if (oddploidy) ploidy <- 2 * ploidy
  RGB <- matrix(0, nrow=ploidy+2, ncol=3)
  colnames(RGB) <- c("red", "green", "blue")
  #RGB[1,] <-          c(   255  , minint,   minint) #red, dosage=0
  #RGB[ploidy/2+1,] <- c(minint/2,    255, minint/2) #green, dosage=ploidy/2
  #RGB[ploidy+1,] <-   c(minint  , minint,      255) #blue, dosage=ploidy
  RGB[1,] <-          c(255,      minint,        minint)              #red, dosage=0
  RGB[ploidy/2+1,] <- c(minint  , minint,        255)                 #blue, dosage=ploidy/2
  RGB[ploidy+1,] <-   c(minint/2, ifelse(pastel, 255, 200), minint/2) #green, dosage=ploidy
  RGB[ploidy+2,] <- col2rgb("grey")                                   #grey, dosage NA
  if (ploidy > 2) for (p in 2:(ploidy/2)) {
    #between nulliplex (red) and ploidy/2 (green): orange-brownish
    fact <- (p-1)/(ploidy/2)
    RGB[p,] <- RGB[1,] + round(fact * (RGB[ploidy/2+1,] - RGB[1,]))
  }
  if (ploidy > 2) for (p in (ploidy/2+2):(ploidy)) {
    #between ploidy/2 (green) and ploidy (blue): cyan-ish
    fact <- (p-(ploidy/2+1))/(ploidy/2)
    RGB[p,] <- RGB[ploidy/2+1,] + round(fact * (RGB[ploidy+1,] - RGB[ploidy/2+1,]))
  }
  if (oddploidy) RGB <- RGB[c(seq(1, ploidy+1, by=2), nrow(RGB)),]
  result <- apply(RGB, 1, function(x) rgb(red=x[1], green=x[2], blue=x[3],
                                          maxColorValue=255))
} #get.genocol

calc.R.threshold <- function(Rquantile, Rfactor=0.5, lowR=0) {
  #Rquantile: a vector of R quantile values per marker (e.g. 95% quantiles)
  #R threshold is Rfactor * Rquantile, but at least lowR
  #This function is used by makeFitPolyFiles and drawXYplots
  result <- Rfactor * Rquantile
  result[result<lowR] <- lowR
  result
} #calc.R.threshold

#'@title Draws an XY-plot showing allele signals and assigned dosages
#'@description Draws an XY-plot for one markers showing the X and Y signals
#'of each sample and their assigned dosages
#'@usage XY_plot(title="", XYdat, shift=0, ploidy=NULL,
#'genocol="grey", pch=1, cex=1,
#'sel.samples=as.character(unique(XYdat$SampleName)), omit.pch=".",
#'omit.col=NULL, sample.groups=list(), groups.col="black", groups.pch=1,
#'groups.cex=1, groups.rnd=FALSE, R.thresholds=NA, R.col="black", R.lty=1)
#'@param title the main title above the plot
#'@param XYdat a data frame with at least columns SampleName, X and Y;
#'column geno (if present) is also used. Contains data for all samples, one SNP.
#'geno is the dosage (0 .. <ploidy>), with NA or <ploidy+1> for missing dosage
#'info
#'@param shift a single integer, default 0: by how much should the geno
#'be shifted?
#'@param ploidy a single integer specifying the ploidy; default NA, only
#'needed if there is a column geno in XYdat
#'@param genocol a vector of color values to be used for plotting
#'the sel.samples according on their geno (dosage) value; if only one value is
#'given (default) all samples are plotted in that color
#'@param pch the plot character to plot the sel.samples; default 1 is an
#'open circle
#'@param cex the relative size of the sample symbols
#'@param sel.samples character vector (not a factor) with the names of the
#'important samples: those that must be plotted in colors genocol and symbol pch
#'@param omit.pch the plot character to use for the other samples, default a dot
#'@param omit.col vector of two colors to use for the other samples; the first
#'color is used for sample with a geno (dosage) value, the second color for
#'unscored samples; recycled if needed, with NA (default) the non-selected
#'samples are not plotted
#'@param sample.groups a list specifying samples to be highlighted in a
#'different color and/or symbol and/or size. For each group of samples the list
#'has one vector of sample names; the list may also be empty
#'@param groups.col a vector or color values, one for each item (vector of
#'sample names) in sample.groups; recycled if shorter than sample.groups
#'@param groups.pch a vector of plot symbols, one for each item in
#'sample.groups; recycled if shorter than sample.groups
#'@param groups.cex a vector of relative symbol sizes, one for each item in
#'sample.groups; recycled if shorter than sample.groups
#'@param groups.rnd FALSE (default) or TRUE. If FALSE, all samples in
#'sample.groups are drawn in the order in which they appear in XYdat; if TRUE
#'they are drawn in a random order. Note that the samples are never drawns in
#'group order (except is the samples are already in group order in XYdat and
#'groups.rnd=FALSE).
#'@param R.thresholds a vector of thresholds for R to plot; is NA (default)
#'no R thresholds are plotted
#'@param R.col a vector of color values for drawing R thresholds, one for
#'each value of R.thresholds; recycled if needed
#'@param R.lty a vector of line types for drawing the R thresholds, one for
#'each value of R.thresholds; recycled if needed
#'@return The function produces an XY-plot and returns NULL
#'@export
XY_plot <- function(title="", #the main title above the plot
                    XYdat, #the data, with or without geno column, for 1 SNP,
                    shift=0,
                    ploidy=NULL,
                    genocol="grey", #if a vector with colors then it should have
                    #               colors for geno 0 .. ploidy and one color for
                    #               missing geno values; i.e. its length should
                    #               be either 1 or ploidy+2
                    pch=1, #point character, 1 is open circle
                    cex=1, #relative size, 1 is default
                    sel.samples=as.character(unique(XYdat$SampleName)), #the important samples
                    omit.pch=".", #the pch for the other samples
                    omit.col=NULL, #the colors for the other samples
                    sample.groups=list(), #groups of samples to highlight
                    groups.col="black", #a color for each group
                    groups.pch=1, #a symbol for each group
                    groups.cex=1, #relative size of symbols for each group
                    groups.rnd=FALSE, #randomize order in which group samples are drawn
                    R.thresholds=NA, #thresholds for R (default must ne NA, not NULL!)
                    R.col="black",   #colors of the R thresholds
                    R.lty=1) {  # line type of R thresholds(1=solid, 2=dashed, 3=dotted)
  #XYdat: a data frame with at least columns SampleName, X and Y; column geno
  #       (if present) is also used. Contains data for all samples, one SNP.
  #       geno is the dosage (0 is nulliplex .. <ploidy> is <ploidy>plex),
  #       with NA or <ploidy+1> for missing dosage info
  #Produces a single XY plot of signal intensities, for one marker.
  #First all samples are plotted either all in one color (grey by default) or
  #with different colors for each dosage (if a column geno is present in XYdat)
  #On top of this, several groups of samples can be plotted (again), with one
  #color per group; e.g. to highlight parents, subpopulations etc)
  #Also the symbol and the size of the symbol to print for the backgound
  #samples and the samples in each group can be specified.
  #Finally different R threshold values can be indicated by lines, of which the
  #color and line type can be specified.
  max.xy <- max(c(XYdat$X, XYdat$Y), na.rm=TRUE)
  if (is.na(max.xy) || max.xy <= 0) {
    plot(0~0, col="white", main=title, xlim=c(0,1), ylim=c(0,1))
  } else {
    if ("geno" %in% names(XYdat)) {
      if (is.null(ploidy)) stop("XY_plot: ploidy must be specified")
      mvgeno <- ploidy+1
      if (!all(XYdat$geno %in% c(0:mvgeno, NA)))
        stop("XY_plot: geno contains invalid values")
      XYdat$geno[XYdat$geno == mvgeno] <- NA
      XYdat$geno <- shiftdosages(XYdat$geno, shift, ploidy)
      XYdat$geno[is.na(XYdat$geno)] <- mvgeno
      genocol <- rep(genocol, length.out=ploidy+2)
    } else {
      #no geno column
      mvgeno <- length(genocol) - 1 #the last value in genocol represents missing genotype
      XYdat$geno <- rep(mvgeno, nrow(XYdat)) #all unknown
    }
    #at this point there are no samples with geno NA left,
    #all those that were NA have been assigned mvgeno
    if (is.null(omit.col)) {
      #then we make all genotyped omitted samples black and the
      #non-genotyped omitted samples grey:
      omit.col <- c(rep("black", length(genocol)-1), "grey")
    } else omit.col <- rep(omit.col, length.out=length(genocol))
    if (is.factor(XYdat$SampleName)) {
      omit.samples <- setdiff(levels(XYdat$SampleName), sel.samples)
    } else omit.samples <- setdiff(unique(XYdat$SampleName), sel.samples)
    # first make the plot with all omitted samples in omit.col:
    # (those with missing geno first, then those with nonmissing geno)
    XY.nogeno <- XYdat[XYdat$geno == mvgeno,, drop=FALSE]
    XY.geno <- XYdat[XYdat$geno < mvgeno,, drop=FALSE]
    XYsub <- XY.nogeno[XY.nogeno$SampleName %in% omit.samples,, drop=FALSE]
    plot(XYsub$Y ~ XYsub$X,
         col=omit.col[XYsub$geno + 1], pch=omit.pch, cex=cex,
         main=title,
         xlim=c(0, max.xy), ylim=c(0, max.xy), xlab="X", ylab="Y")
    XYsub <- XY.geno[XY.geno$SampleName %in% omit.samples,, drop=FALSE]
    points(XYsub$Y ~ XYsub$X,
           col=omit.col[XYsub$geno + 1], pch=omit.pch, cex=cex)
    # next make the plot with all non-omitted samples in genocol:
    # (those with missing geno first, then those with nonmissing geno)
    XYsub <- XY.nogeno[XY.nogeno$SampleName %in% sel.samples,, drop=FALSE]
    points(XYsub$Y ~ XYsub$X,
           col=genocol[XYsub$geno + 1], pch=pch, cex=cex)
    XYsub <- XY.geno[XY.geno$SampleName %in% sel.samples,, drop=FALSE]
    points(XYsub$Y ~ XYsub$X,
           col=genocol[XYsub$geno + 1], pch=pch, cex=cex)
    #next, draw the groups of samples
    if (is.list(sample.groups) &&
        length(sample.groups) > 0) {
      groups.col <- rep(groups.col, length.out=length(sample.groups))
      groups.pch <- rep(groups.pch, length.out=length(sample.groups))
      groups.cex <- rep(groups.cex, length.out=length(sample.groups))
      # old version: draw groups successively, last group on top
      #for (grp in seq_along(sample.groups)) {
        #XYsub <- XYdat[XYdat$SampleName %in% sample.groups[[grp]],]
        #points(XYsub$Y ~ XYsub$X,
        #       col=groups.col[grp], pch=groups.pch[grp], cex=groups.cex[grp])
      #}
      # new version 20180219: draw all at the same time
      # (order of points is order in the XYdat data frame,
      # there is no group order with one obscuring the other)
      XYsub <- XYdat[XYdat$SampleName %in% unlist(sample.groups),, drop=FALSE]
      samp.col <- samp.pch <- samp.cex <- numeric(nrow(XYsub))
      for (grp in seq_along(sample.groups)) {
        grpsamp <- XYsub$SampleName %in% sample.groups[[grp]]
        samp.col[grpsamp] <- groups.col[grp]
        samp.pch[grpsamp] <- groups.pch[grp]
        samp.cex[grpsamp] <- groups.cex[grp]
      }
      samporder <- seq_len(nrow(XYsub))
      if (groups.rnd) {
        samporder <- sample(samporder)
      }
      points(XYsub$Y[samporder] ~ XYsub$X[samporder], col=samp.col[samporder],
             pch=samp.pch[samporder], cex=samp.cex[samporder])
    }
    #next, draw the R.threshold lines at X + Y = R.threshold:
    R.col <- rep(R.col, length.out=length(R.thresholds))
    R.lty <- rep(R.lty, length.out=length(R.thresholds))
    for (i in 1:length(R.thresholds)) {
      lines(c(0, R.thresholds[i]) ~ c(R.thresholds[i], 0),
            col=R.col[i], lty=R.lty[i]) #no line if R.threshold[i] is NA
    }
  }
  invisible(NULL)
} #XY_plot

#'@title Draws a series of pages, each with 6 XY-plots showing allele signals
#'and assigned dosages
#'@description Draws 6 XY-plots per page for a series of markers; each XY-plot is
#'drawn by function XY_plot
#'@usage drawXYplots(dat, markers=NA, out, genocol="grey", pch=1, cex=1,
#'sel.samples=as.character(unique(XYdat$SampleName)), omit.pch=".",
#'omit.col=c(rep("black", length(genocol)-1), "grey"), sample.groups=list(),
#'groups.col="black", groups.pch=1, groups.cex=1, groups.rnd=FALSE,
#'R.col="black", R.lty=1, drawRthresholds=FALSE,
#'Rthreshold.param=c(0.95, 0.5, 0), ploidy)
#'@param dat a data.frame with at least columns MarkerName, SampleName, X and Y;
#'column geno (if present) is also used
#'@param markers either a vector with names of markers to plot or a
#'data.frame with at least column MarkerName containing the names of the markers
#'to plot; default NA means all markers in dat. If markers is a data.frame that
#'also has a column shift, the geno values in dat will be shifted accordingly.
#'This allows to use (a selection from) the output of checkF1 (with or without
#'parameter shiftmarkers) as input for drawXYplots.
#'@param out base for filenames of output, will be extended with
#'_pagenumber.png; may include a path, but the directory where the plot files
#'are to be saved must already exist
#'@param genocol a vector of color values to be used for plotting
#'the sel.samples according on their geno (dosage) value; if only one value is
#'given (default) all samples are plotted in that color
#'@param pch the plot character to plot the sel.samples; default 1 is an
#'open circle
#'@param cex the relative size of the sample symbols
#'@param sel.samples character vector (not a factor) with the names of the
#'important samples: those that must be plotted in colors genocol and symbol pch
#'@param omit.pch the plot character to use for the other samples, default a dot
#'@param omit.col vector of two colors to use for the other samples; the first
#'color is used for sample with a geno (dosage) value, the second color for
#'unscored samples
#'@param sample.groups a list specifying samples to be highlighted in a
#'different color and/or symbol and/or size. For each group of sample the list
#'has one vector of sample names; the list may also be empty
#'@param groups.col a vector or color values, one for each item (vector of
#'sample names) in sample.groups; recycled if shorter than sample.groups
#'@param groups.pch a vector of plot symbols, one for each item in
#'sample.groups; recycled if shorter than sample.groups
#'@param groups.cex a vector of relative symbol sizes, one for each item in
#'sample.groups; recycled if shorter than sample.groups
#'@param groups.rnd FALSE (default) or TRUE. If FALSE, all samples in
#'sample.groups are drawn in the order in which they appear in XYdat; if TRUE
#'they are drawn in a random order. Note that the samples are never drawns in
#'group order (except if the samples are already in group order in XYdat and
#'groups.rnd=FALSE).
#'@param R.col a vector of color values for drawing R thresholds, one for
#'each value of Rthreshold.param; recycled if needed
#'@param R.lty a vector of line types for drawing the R thresholds, one for
#'each value of Rthreshold.param; recycled if needed
#'@param drawRthresholds FALSE (default) or TRUE: whether R thresholds should be
#'drawn
#'@param Rthreshold.param either a list of vectors each of length 3 (for
#'multiple R thresholds) or one vector of length 3 (for one R threshold). Each
#'vector defines one R threshold and is based on a specified quantile of the
#'distribution of R values for the current marker. The first number in each
#'vector is the R quantile, the second is a number to multiply that R quantile
#'with, and the third is the minimum value of the result. The default of
#'c(0.95, 0.5, 0) means that 0.5 * the 95% quantile of R is used (which is
#'always higher than 0, the minimum result). This is often a good cut-off
#'value to discard samples, or to signal markers with many samples below that
#'value.
#'@param ploidy a single integer specifying the ploidy, only needed if
#'dat contains a column geno
#'@return The function produces a series of pages with plots and returns NULL
#'@export
drawXYplots <- function(dat, markers=NA, out,
                        genocol="grey", #colors for dosage 0-ploidy, NA=ploidy+1
                        pch=1, #point character, 1 is open circle
                        cex=1, #relative size, 1 is default
                        sel.samples=as.character(unique(XYdat$SampleName)),
                        omit.pch=".", #point character for omitted samples
                        omit.col=c(rep("black",length(genocol)-1),"grey"),
                        sample.groups=list(), #groups of samples to highlight
                        groups.col="black", #a color for each group
                        groups.pch=1, #a symbol for each group
                        groups.cex=1, #relative size of symbols for each group
                        groups.rnd=FALSE, #randomize order in which group samples are drawn
                        R.col="black",   #colors of the R thresholds
                        R.lty=1,
                        drawRthresholds=FALSE,
                        Rthreshold.param=c(0.95, 0.5, 0),
                        ploidy) {
  #Draws a series of pages with 6 XYplots each, one XYplot per marker in
  #markers. The individual plots are drawn by function XY_plot.
  #dat: a data frame with at least columns MarkerName, SampleName, X and Y;
  #     column geno (if present) is also used.
  #markers: a vector with names of markers to plot, or marker numbers;
  #         default NA means all
  #out: base for filenames of output, will be extended with _pagenumber.png
  #drawRthresholds: if TRUE, R threshold lines are drawn according to:
  #Rthreshold.param: either a list of vectors each of length 3 (for multiple
  #                  R thresholds) or one vector of length 3 (for one R threshold).
  #                  the first element of the vector is the R quantile P-value,
  #                  the 2nd and 3rd are Rfactor and lowR as in
  #                  calc.R.threshold.
  #The other parameters are as in XY_plot

  #close devices - extra strong:
  allDevOff()
  allmarkers <- as.character(unique(dat$MarkerName))
  if (length(allmarkers) == 0) stop("no markers in dat")
  if ("geno" %in% names(dat) && missing(ploidy))
    stop("ploidy must be specified")
  if (is.data.frame(markers)) {
    #markernames must be in a column MarkerName, and if there is a column
    #shift, this shift is retained
    #This new (20170113) alternative allows to supply a selection of rows from
    #a checkF1 output to specify markers and possibly shifts
    if (!("MarkerName" %in% names(markers)))
      stop("no column 'MarkerName' in markers data.frame")
    if (!is.character(markers$MarkerName) && !is.factor(markers$MarkerName))
      stop("column Markername of markers must contain marker names")
    notNA <- !is.na(markers$MarkerName)
    if (sum(notNA) == 0) {
      #no markers specified, use all:
      mrkdf <- data.frame(MarkerName=allmarkers, shift=0, stringsAsFactors=FALSE)
    } else {
      if ("shift" %in% names(markers)) {
        if (!("geno" %in% names(dat))) stop("markers has a column 'shift' but dat has no column 'geno'")
        if (anyNA(markers$shift[notNA])) stop("no missing values allowed in column 'shift' of markers data.frame")
        if (!is.numeric(markers$shift) || any(as.integer(markers$shift) != markers$shift))
          stop("column 'shift' of markers data.frame must contain integer shift values")
        if (!is.null(ploidy) && !all(markers$shift %in% (-ploidy):ploidy))
          stop("invalid values in column 'shift' of markers data.frame")
        mrkdf <- data.frame(MarkerName=trim_shf(markers$MarkerName[notNA]),
                            shift=markers$shift[notNA], stringsAsFactors=FALSE)
      } else mrkdf <- data.frame(MarkerName=trim_shf(markers$MarkerName[notNA]),
                                 shift=0, stringsAsFactors=FALSE)
    }
  } else {
    #markers not a data.frame
    if (is.list(markers) || !is.vector(markers))
      stop("markers must be a character vector of marker names, or a data.frame with a column MarkerName")
    #now markers is a vector
    notNA <- !is.na(markers)
    if (sum(notNA) == 0) {
      #no markers specified, use all:
      mrkdf <- data.frame(MarkerName=allmarkers, shift=0, stringsAsFactors=FALSE)
    } else {
      if (!is.character(markers) && !is.factor(markers))
        stop("markers must contain marker names")
      mrkdf <- data.frame(MarkerName=trim_shf(markers[notNA]), shift=0, stringsAsFactors=FALSE)
    }
  }
  #now the mrkdf data.frame has been created; exclude markers not in allmarkers (but retain duplicated):
  pres <- mrkdf$MarkerName %in% allmarkers
  if (sum(pres) < nrow(mrkdf)) {
    if (sum(pres) == 0) stop("none of the specified markers occurs in dat")
    mrkdf <- mrkdf[pres,]
  }

  #create small subset of dat to select markers from:
  if (nrow(mrkdf) < 0.5 * length(allmarkers)) {
    dat <- dat[dat$MarkerName %in% mrkdf$MarkerName,]
  }
  maxpage <- ceiling(nrow(mrkdf) / 6)
  m.start <- 1
  m.end <- nrow(mrkdf)
  nr_on_page <- 0
  for (m in m.start:m.end) {
    mrknr <- padded(m, m.end)
    nr_on_page <- nr_on_page+1
    if (nr_on_page == 7) nr_on_page <- 1
    if (nr_on_page == 1) {
      pagenr <- padded((m-1) %/% 6 + 1, maxpage)
      png(paste(out, "_", pagenr, ".png", sep=""),
          width=7.2, height=11.7, units="in", res=96)
      layout(matrix(1:6, byrow=TRUE, ncol=2)) # 3*2 plots on page
    }
    XYdat <- dat[dat$MarkerName == mrkdf$MarkerName[m],]
    if ("R" %in% names(XYdat)) XYR <- XYdat$R else XYR <- XYdat$X + XYdat$Y
    if (drawRthresholds) {
      if (!is.list(Rthreshold.param)) Rthreshold.param <- list(Rthreshold.param)
      R.thresholds <- rep(NA, length(Rthreshold.param))
      for (i in seq_along(Rthreshold.param)) {
        Rquantile <- quantile(XYR, Rthreshold.param[[i]][1], na.rm=TRUE)
        R.thresholds[i] <- calc.R.threshold(Rquantile,
                                            Rthreshold.param[[i]][2],
                                            Rthreshold.param[[i]][3])
      }
    }
    else R.thresholds <- NA
    XY_plot(title=paste(mrknr, mrkdf$MarkerName[m]), XYdat=XYdat,
            shift=mrkdf$shift[m], ploidy=ploidy,
            genocol=genocol, pch=pch, cex=cex,
            sel.samples=sel.samples, omit.pch=omit.pch, omit.col=omit.col,
            sample.groups=sample.groups, groups.col=groups.col,
            groups.pch=groups.pch, groups.cex=groups.cex, groups.rnd=groups.rnd,
            R.thresholds=R.thresholds, R.col=R.col, R.lty=R.lty)

    #if last plot on page, close device:
    if (nr_on_page == 6 || m >= m.end) {
      #close devices - extra strong:
      allDevOff()
    }
  } # for m
  invisible(NULL)
} #drawXYplots

allDevOff <- function() {
  #close all graphics devices - extra strong
  #dvof <- 2
  #while (!is.null(dvof) && dvof > 1) dvof <- dev.off()
  while (!is.null(dev.list())) dev.off()
}

# fitPoly input and output files **********************************************

#'@title Make input files for fitPoly containing only selected rows and columns
#'@description Remove entire markers and ratios for samples within markers
#'from fitPoly inputfiles if their R levels are below some thresholds;
#'also keep only the MarkerName, SampleName, X, Y, R and ratio columns
#'@usage makeFitPolyFiles(datpoly, datdiplo=NA, out,
#'filetype=c("dat", "RData")[2], Rquantile=0.95, marker.threshold,
#'Rthreshold.param=c(0.95, 0.5, 0))
#'@param datpoly data.frame as produced by readFullDataTable, readAxiomSummary
#'or splitNrenameSamples, with at least columns MarkerName, SampleName, R and
#'ratio
#'@param datdiplo data.frame as produced by splitNrenameSamples, with at least
#'columns MarkerName, SampleName, R and ratio, containing the data for the
#'diploid samples; or NA (default)
#'@param out output files will be named out + _poly.RData and _diplo.RData
#'(or .dat if filetype is set to "dat"). If NA no files are written
#'@param filetype "RData" (default) or "dat": the format to save the data in
#'@param Rquantile a value between 0 and 1: the R (= X + Y) quantile on which
#'selection of markers is based
#'@param marker.threshold the minimum value of the Rquantile for a marker to be
#'selected
#'@param Rthreshold.param a vector of length 3, defining an R threshold for
#'selecting samples within each marker.
#'The first number is the R quantile, the second is a number to multiply that
#'R quantile with, and the third is the minimum value of the result.
#'The default of c(0.95, 0.5, 0) means that 0.5 * the 95% quantile of R is used
#'(which is always higher than 0, the minimum result). This is often a good
#'cut-off value to discard samples within a marker.
#'@details All rows for markers not meeting the marker.threshold for the
#'R quantile are removed from the original data.frames. For samples within
#'the remaining markers where the R value is below the marker-specific threshold
#'the ratio is set to NA, the row for that sample is not removed.
#'The same thresholds are applied to datpoly and datdiplo.
#'@return a list with two elements: datpoly is a filtered version of
#'parameter datpoly, and if param datdiplo was specified the second element
#'of the returned list is datdiplo: a filtered version of parameter datdiplo.
#'Either or both may have 0 rows left after filtering. If no datdiplo was
#'specified the second element of the list is NA.\cr
#'If parameter out is specified the element(s) of this list are also saved
#'as files.
#'@export
makeFitPolyFiles <- function(datpoly, datdiplo=NA,
                             out, filetype=c("dat", "RData")[2],
                             Rquantile=0.95, marker.threshold,
                             Rthreshold.param=c(0.95, 0.5, 0)) {

  cols <- match(c("MarkerName", "SampleName", "R", "ratio"),
                names(datpoly))
  if (anyNA(cols))
    stop("required columns missing from datpoly")
  savediplo <- is.data.frame(datdiplo)
  if (savediplo) {
    cols <- match(c("MarkerName", "SampleName", "R", "ratio"),
                  names(datdiplo))
    if (anyNA(cols))
      stop("required columns missing from datdiplo")
  }
  out <- out[1]
  if (!is.na(out)) {
    testfilename <- ifelse(filetype=="dat",
                           paste(out,"poly.dat", sep="_"),
                           paste(out,"poly.RData", sep="_"))
    if (!check.filename(testfilename))
      stop("out is not valid")
  }

  quant <- function(x, Rquantile) { #function nested within make.fitPoly.files
    quantile(x, probs=Rquantile, na.rm=TRUE, names=FALSE)
  }

  Rq <- tapply(datpoly$R, INDEX=datpoly$MarkerName, FUN=quant, Rquantile)

  #remove all markers where Rq is below threshold,
  #and retain only relevant columns:
  #(works well even if marker names are not valid identifiers)
  selmarkers <- names(Rq)[!is.na(Rq) & Rq >= marker.threshold]
  #remove these markers from datpoly:
  datpoly <- datpoly[datpoly$MarkerName %in% selmarkers,]
  datpoly$MarkerName <- factor(datpoly$MarkerName) #drop unused levels
  #remove same markers from Rq:
  Rq <- Rq[!is.na(Rq) & Rq >= marker.threshold]
  #calculate R threshold for samples within each remaining marker:
  if (Rthreshold.param[1] == Rquantile) Sq <- Rq else {
    Sq <- tapply(datpoly$R, INDEX=datpoly$MarkerName, FUN=quant,
                 Rquantile=Rthreshold.param[1])
  }
  Sq <- calc.R.threshold(Sq, Rthreshold.param[2], Rthreshold.param[3])
  Tq <- Sq[match(datpoly$MarkerName, names(Sq))]
  datpoly$ratio[datpoly$R < Tq] <- NA
  result <- list(datpoly=datpoly)
  if (savediplo) {
    datdiplo <- datdiplo[datdiplo$MarkerName %in% selmarkers,]
    datdiplo$MarkerName <- factor(datdiplo$MarkerName) #drop unused levels
    Tq <- Sq[match(datdiplo$MarkerName, names(Sq))]
    datdiplo$ratio[datdiplo$R < Tq] <- NA
    result$datdiplo <- datdiplo
  } else result$datdiplo <- NA

  if (!is.na(out)) {
    if (filetype=="dat") {
      writeDatfile(datpoly, paste(out,"poly.dat", sep="_"))
      if (savediplo) {
        writeDatfile(datdiplo, paste(out,"diplo.dat", sep="_"))
      }
    } else {
      #write RData files:
      save(datpoly, file=paste(out,"poly.RData", sep="_"))
      if (savediplo) {
        save(datdiplo, file=paste(out,"diplo.RData", sep="_"))
      }
    }
  }
  invisible(result)
} #makeFitPolyFiles

# some general functions ******************************************************

check.filename <- function(filename, overwrite=TRUE) {
  #Check if a filename is valid for writing by trying to create the file
  #If the filename contains a path, result will only be TRUE if the whole
  #path already exists
  #overwrite: if TRUE (default) an existing file of that name will be deleted
  #           (if it is not protected by being locked, read-only, owned by
  #           another user etc)
  filename <- filename[1] # we test only the first filename
  if (!overwrite && file.exists(filename)) return(FALSE)
  tryCatch(
    { suppressWarnings(
      if (file.create(filename)) {
        file.remove(filename)
        TRUE
      } else FALSE)
    }, error = function(e) { FALSE }
  )
} #check.filename

#'@title User-friendly wrapper for read.table
#'@description  A wrapper for read.table that has default parameter values for
#'reading tab-separated files as used in packages fitPoly and fitPolyTools
#'@usage readDatfile(file, header=TRUE, sep="\t", check.names=FALSE, ...)
#'@param file the name of the file which the data are to be read from
#'@param header a logical value indicating whether the file contains the names
#'of the data.frame columns as its first line
#'@param sep the field separator character
#'@param check.names logical. If FALSE (default), column names are not checked.
#'This is important if column names are the names of samples, markers etc that
#'may not be syntactically valid variable names.\cr
#'If TRUE then the names of the variables in the data frame are checked to
#'ensure that they are syntactically valid variable names. If necessary they
#'are adjusted (by make.names) so that they are, and also to ensure that there
#'are no duplicates
#'@param ... Further arguments to be passed to read.table
#'@return A data.frame containing a representation of the data in the file
#'@export
readDatfile <- function(file, header=TRUE, sep="\t", check.names=FALSE, ...) {
  #calls read.table with changed defaults, for reading a tab-separated
  #text file, and without modifying column names
  #The only compulsory argument is the file name
  return(read.table(file=file, header=header, sep=sep,
                    check.names=check.names, ...))
} #readDatfile

#'@title User-friendly wrapper for write.table
#'@description A wrapper for write.table that has default parameter values for
#'writing tab-separated files as used in packages fitPoly and fitPolyTools
#'@usage writeDatfile(x, file, quote=FALSE, sep="\t", na="",
#'row.names=FALSE, col.names=TRUE, logical_01=FALSE, ...)
#'@param x the object to be written, preferably a matrix or data frame.
#'If not, it is attempted to coerce x to a data frame.
#'@param file either a character string naming a file or a connection open for
#'writing. "" indicates output to the console.
#'@param quote a logical value (TRUE or FALSE) or a numeric vector. If TRUE,
#'any character or factor columns will be surrounded by double quotes.
#'If a numeric vector, its elements are taken as the indices of columns to
#'quote. In both cases, row and column names are quoted if they are written.
#'If FALSE (default), nothing is quoted.
#'@param sep the field separator string. Values within each row of x are
#'separated by this string.
#'@param na the string to use for missing values in the data
#'@param row.names either a logical value indicating whether the row names of x
#'are to be written along with x, or a character vector of row names to be
#'written.
#'@param col.names either a logical value indicating whether the column names
#'of x are to be written along with x, or a character vector of column names
#'to be written.
#'@param logical_01 FALSE (default) or TRUE; if TRUE, logical columns of x
#'are converted to 0/1/NA numeric values before writing the file
#'@param ... Further arguments to be passed to write.table
#'@return NULL
#'@export
writeDatfile <- function(x, file, quote=FALSE, sep="\t", na="",
                         row.names=FALSE, col.names=TRUE,
                         logical_01=FALSE, ...) {
  #calls write.table with changed defaults, for writing a tab-separated
  #text file
  #The only compulsory arguments are x and the file name
  if (logical_01) {
    #convert logicals to 0/1/NA for writing text file
    if (is.logical(x)) x[x] <- 1 else if (is.data.frame(x)) {
      for (col in seq_len(ncol(x))) {
        if ("logical" %in% class(x[, col])) {
          xc <- x[,col]
          xc[xc] <- 1 #with NAs present we cannot do x[x[,col],col] <- 1
          x[,col] <- xc
        }
      }
    }
  }
  write.table(x=x, file=file, quote=quote, sep=sep, na=na,
              row.names=row.names, col.names=col.names, ...)
} #writeDatfile

#padded returns a string representing positive integer x left-padded with 0's
#to the length of integer maxx or of max(x)
padded <- function(x, maxx=0) {
  formatC(x, width=nchar(max(x, maxx, na.rm=TRUE)), flag="0")
}



#*************************************************************************

gcd <- function(x,y) {
  #return the greatest common divisor of two integers
  #(vectorized: x and y may be vectors)
  r <- x %% y;
  return(ifelse(r, gcd(y, r), y))
} #gcd

gcd_all <- function(x) {
  #return the greatest common divisor of all elements of vector x
  if (length(x) == 1) x else {
    if (length(x) == 2) gcd(x[1], x[2]) else
      gcd_all(c(gcd(x[1], x[2]), x[3:length(x)]))
  }
} #gcd_all

polygamfrq <- function(ploidy, dosage) {
  #get the integer ratios of gametes with dosage 0:(ploidy/2)
  #from an autopolyploid parent (polysomic inheritance)
  #with given ploidy and dosage
  #the result can be converted to fractions by y <- x/(sum(x))
  gp <- ploidy/2 #gamete ploidy

  #using hypergeometric distribution:
  #dhyper(0:gp, dosage, ploidy-dosage, gp)

  #using n-over-k: function choose, in order to return integer ratios:
  A <- choose(dosage, 0:gp)
  B <- choose(ploidy-dosage, gp-(0:gp))
  #C <- choose(ploidy, gp)
  #A * B / C this would give the answer as fractions,
  # like dhyper(0:gp, dosage, ploidy-dosage, gp)
  D <- round(A*B)
  d <- gcd_all(D[D>0])
  round(D/d)
}

digamfrq <- function(ploidy, dosage) {
  #get the integer ratios of gametes with dosage 0:(ploidy/2)
  #from an allopolyploid parent (disomic inheritance)
  #with given ploidy and dosage
  gp <- ploidy/2 #gamete ploidy
  #first step: calculate all possible compositions of diploid parental genomes
  #i.e. the numbers of subgenomes with nulliplex, simplex and duplex dosages
  maxdup <- dosage %/% 2 #max nr of subgenomes with duplex dosage
  mindup <- max(dosage-gp, 0) #min nr of subgenomes with duplex dosage
  genomes <- matrix(integer((maxdup-mindup+1)*3), ncol=3)
  colnames(genomes) <- c("nulli", "sim", "du")
  for (i in 1:nrow(genomes)) {
    genomes[i,3] <- mindup + i - 1
    genomes[i,2] <- dosage - 2*genomes[i,3]
    genomes[i,1] <- gp - sum(genomes[i, 2:3])
  }
  #next step: per genomic composition calculate the gamete distributions:
  gam <- matrix(numeric(nrow(genomes)*(gp+1)), nrow=nrow(genomes)) #all 0
  colnames(gam) <- 0:gp
  for (i in 1:nrow(genomes)) {
    mingamdos <- genomes[i,3] #each duplex genome contributes 1
    maxgamdos <- gp-genomes[i,1] #each nulliplex genome contributes 0
    bincoeff <- choose(genomes[i,2], 0:genomes[i,2])

    # to return fractions:
    #gam[i, (mingamdos+1):(maxgamdos+1)] <- bincoeff / sum(bincoeff)

    #to return integer ratios:
    gam[i, (mingamdos+1):(maxgamdos+1)] <- bincoeff
    #the smallest bincoeff is always 1 so division by the gcd is not needed
  }
  #list(genomes, gam)
  gam
} #digamfrq

makeProgeny <- function(gametes1, gametes2) {
  #gametes1, gametes2: vectors of the integer ratios of the gametes
  #                    produced by the two parents
  #returns a vector with the integer ratios of the F1 progeny
  gp1 <- length(gametes1)-1 #gametes1 ploidy
  gp2 <- length(gametes2)-1 #gametes2 ploidy
  m <- matrix(0, nrow=gp1+gp2+1, ncol=gp2+1)
  for (i in 1:(gp2+1))
    m[i:(i+gp1), i] <- gametes1 * gametes2[i]
  D <- round(rowSums(m)) #the integer ratios of the F1 generation
  names(D) <- 0:(length(D)-1)
  d <- gcd_all(D[D>0])
  round(D/d) #the integer ratios of the F1 generation simplified
} #makeProgeny

# calcSegtypeInfo ***********************************************************
#'@title Build a list of segregation types
#'@description For each possible segregation type in an F1 progeny with given
#'parental ploidy (and ploidy2, if parent2 has a different ploidy than parent1)
#'information is given on the segregation ratios, parental dosages and whether
#'the segregation is expected under polysomic, disomic and/or mixed inheritance.
#'
#'@usage calcSegtypeInfo(ploidy, ploidy2=NULL)
#'
#'@param ploidy The ploidy of parent 1 (must be even, 2 (diploid) or larger).
#'@param ploidy2 The ploidy of parent 2. If omitted (default=NULL) it is
#'assumed to be equal to ploidy.
#'@details The names of the segregation types consist of a short sequence of
#'digits (and sometimes letters), an underscore and a final number. This is
#'interpreted as follows, for example segtype 121_0: 121 means that there
#'are three consecutive dosages in the F1 population with frequency ratios 1:2:1,
#'and the 0 after the underscore means that the lowest of these dosages is
#'nulliplex. So 121_0 means a segregation of 1 nulliplex : 2 simplex : 1 duplex.
#'A monomorphic F1 (one single dosage) is indicated as e.g. 1_4 (only one
#'dosage, the 4 after the underscore means that this is monomorphic quadruplex).
#'If UPPERCASE letters occur in the first part of the name these are interpreted
#'as additional digits with values of A=10 to Z=35, e.g. 18I81_0 means a
#'segregation of 1:8:18:8:1 (using the I as 18), with the lowest dosage being
#'nulliplex.\cr
#'With higher ploidy levels higher numbers (above 35) may be required.
#'In that case each unique ratio number above 35 is assigned a lowercase letter.
#'E.g. one segregation type in octaploids is 9bcb9_2: a 9:48:82:48:9
#'segregation where the lowest dosage is duplex.\cr
#'Segregation types with more than 5 dosage classes are considered "complex"
#'and get codes like c7e_1 (again in octoploids): this means a complex type
#'(the first c) with 7 dosage classes; the e means that this is the fifth
#'type with 7 classes. Again the _1 means that the lowest dosage is simplex.
#'It is always possible (and for all segtype names with lowercase letters it is
#'necessary) to look up the actual segregation ratios in the intratio item
#'of the segtype. For octoploid segtype c7e_1 this shows 0:1:18:69:104:69:18:1:0
#'(the two 0's mean that nulli- and octoplexes do not occur).
#'
#'@return A list with for each different segregation type (segtype) one item.
#'The names of the items are the names of the segtypes.
#'Each item is itself a list with components:
# (alternatives for itemize: enumerate and describe,
# see https://cran.r-project.org/web/packages/roxygen2/vignettes/formatting.html)
#'\itemize{
#'\item{freq: a vector of the ploidy+1 fractions of the dosages in the F1}
#'\item{intratios: an integer vector with the ratios as the simplest integers}
#'\item{expgeno: a vector with the dosages present in this segtype}
#'\item{allfrq: the allele frequency of the dosage allele in the F1}
#'\item{polysomic: boolean: does this segtype occur with polysomic inheritance?}
#'\item{disomic: boolean: does this segtype occur with disomic inheritance?}
#'\item{mixed: boolean: does this segtype occur with mixed inheritance (i.e. with
#'polysomic inheritance in one parent and disomic inheritance in the other)?}
#'\item{pardosage: integer matrix with 2 columns and as many rows as there
#'are parental dosage combinations for this segtype;
#'each row has one possible combination of dosages for
#'parent 1 (1st column) and parent 2 (2nd column)}
#'\item{parmode: logical matrix with 3 columns and the same number of rows as
#'pardosage. The 3 columns are named polysomic, disomic and mixed and
#'tell if this parental dosage combination will generate this
#'segtype under polysomic, disomic and mixed inheritance}
#'}
#'@examples
#'si4 <- calcSegtypeInfo(ploidy=4) # two 4x parents: a 4x F1 progeny
#'print(si4[["11_0"]])
#'
#'si3 <- calcSegtypeInfo(ploidy=4, ploidy2=2) # a 4x and a diplo parent: a 3x progeny
#'print(si3[["11_0"]])
#'@export
calcSegtypeInfo <- function(ploidy=4, ploidy2=NULL) {
  if (ploidy %% 2 != 0) stop("calcSegtypeInfo: odd ploidy not allowed")
  if (is.null(ploidy2)) ploidy2 <- ploidy else
    if (ploidy2 %% 2 != 0) stop("calcSegtypeInfo: odd ploidy2 not allowed")
  ploidyF1 <- (ploidy + ploidy2) / 2

  result <- list()
  #we check all parental combinations and enumerate the segtypes:
  for (p1 in 0:ploidy) {
    if (p1 %in% c(0, 1, ploidy-1, ploidy)) {
      p1ir <- matrix(polygamfrq(ploidy, p1), nrow=1)
      p1mode <- 0 #0=don't care
    } else {
      p1ir <- rbind(polygamfrq(ploidy, p1), digamfrq(ploidy, p1))
      p1mode <- c(1, rep(2, nrow(p1ir)-1)) #1=polysomic, 2=disomic
    }
    if (ploidy2 == ploidy) maxp2 <- p1 else maxp2 <- ploidy2
    for (p2 in 0:maxp2) {
      if (p2 %in% c(0, 1, ploidy2-1, ploidy2)) {
        p2ir <- matrix(polygamfrq(ploidy2, p2), nrow=1)
        p2mode <- 0 #0=don't care
      } else {
        p2ir <- rbind(polygamfrq(ploidy2, p2), digamfrq(ploidy2, p2))
        p2mode <- c(1, rep(2, nrow(p2ir)-1)) #1=polysomic, 2=disomic
      }
      for (i1 in 1:nrow(p1ir)) for (i2 in 1:nrow(p2ir)) {
        ir <- makeProgeny(p1ir[i1,], p2ir[i2,])
        #check if this segtype already exists in result:
        r <- length(result)
        while (r > 0 && sum(ir != result[[r]]$intratios) > 0) r <- r-1
        if (r == 0) {
          #new segtype:
          r <- length(result) + 1
          result[[r]] <- list()
          result[[r]]$freq <- ir/sum(ir)
          result[[r]]$intratios <- ir
          result[[r]]$expgeno <- which(ir > 0) - 1; names(result[[r]]$expgeno) <- NULL
          result[[r]]$allfrq <- sum(result[[r]]$freq * 0:ploidyF1) / ploidyF1
          result[[r]]$polysomic <- FALSE
          result[[r]]$disomic <- FALSE
          result[[r]]$mixed <- FALSE
          result[[r]]$pardosage <- matrix(c(p1, p2), nrow=1)
          colnames(result[[r]]$pardosage) <- c("P1", "P2")
          result[[r]]$parmode <- matrix(FALSE, nrow=1, ncol=3)
          colnames(result[[r]]$parmode) <- c("polysomic", "disomic", "mixed")
          p <- 1
        } else {
          #segtype already existed, see if parental combination
          #already existed:
          p <- nrow(result[[r]]$pardosage)
          while (p > 0 &&
                 sum(c(p1,p2) != result[[r]]$pardosage[p,]) > 0) p <- p-1
          if (p == 0) {
            #this parental dosage combination is new
            p <- nrow(result[[r]]$pardosage) + 1
            result[[r]]$pardosage <- rbind(result[[r]]$pardosage, c(p1, p2))
            result[[r]]$parmode <- rbind(result[[r]]$parmode,
                                         rep(FALSE, 3))
          }
        }
        result[[r]]$parmode[p, 1] <- result[[r]]$parmode[p, 1] ||
          (p1mode[i1] != 2 && p2mode[i2] != 2) #is polysomic?
        result[[r]]$parmode[p, 2] <- result[[r]]$parmode[p, 2] ||
          (p1mode[i1] != 1 && p2mode[i2] != 1) #is disomic?
        result[[r]]$parmode[p, 3] <- result[[r]]$parmode[p, 3] ||
          (p1mode[i1] == 0 || p2mode[i2] == 0 || p1mode[i1] != p2mode[i2]) #mixed?
      } #for i1 for i2
    } #for p2
  } #for p1
  #next we check if each segtype occurs with poly- or disomic or
  #mixed inheritance,
  #we add all reciprocal parental dosages,
  #we order all pardosage and parmode in order of increasing p1,
  #we check which intratios occur in segtypes with less than complex classes,
  #and we compile the numdosages and firstdosage statistics:
  numdosages <- firstdosage <- maxratio <- integer(length(result))
  complex <- 6 #the number of dosage classes from which we consider the segtype
  #             complex, i.e. we don't show the full segregation in the name
  simpleratios <- integer(0)
  for (r in seq_along(result)) {
    #calculate poly- and disomic and mixed:
    result[[r]]$polysomic <- sum(result[[r]]$parmode[, 1]) > 0
    result[[r]]$disomic <- sum(result[[r]]$parmode[, 2]) > 0
    result[[r]]$mixed <- sum(result[[r]]$parmode[, 3]) > 0
    #add the reverse of parents with unequal dosage:
    if (ploidy2 == ploidy) {
      #if ploidy2 not null, all parental combinations already done
      uneq <- result[[r]]$pardosage[,1] != result[[r]]$pardosage[,2]
      if (sum(uneq) > 0) {
        #there are rows with unequal parental dosage
        result[[r]]$pardosage <- rbind(result[[r]]$pardosage,
                                       result[[r]]$pardosage[uneq, 2:1])
        result[[r]]$parmode <- rbind(result[[r]]$parmode,
                                     result[[r]]$parmode[uneq,])
      }
    }
    #sort pardosage and parmode in order of increasing P1 dosage:
    or <- order(result[[r]]$pardosage[,1])
    result[[r]]$pardosage <- result[[r]]$pardosage[or,, drop=FALSE]
    result[[r]]$parmode <- result[[r]]$parmode[or,, drop=FALSE]
    #add the intratios in not complex:
    if (length(result[[r]]$expgeno) < complex)
      simpleratios <- union(simpleratios, result[[r]]$intratios)
    #calculate statistics for later ordering of result list:
    numdosages[r] <- length(result[[r]]$expgeno)
    firstdosage[r] <- result[[r]]$expgeno[1]
    maxratio[r] <- max(result[[r]]$intratios)
  }
  #next we reorder result from simple to complex segtypes
  or <- order(numdosages, maxratio, firstdosage)
  result <- result[or]
  #finally we assign names according to numdosages and firstdosage:
  numdosages <- numdosages[or]
  firstdosage <- firstdosage[or]
  maxratio <- maxratio[or]
  simple <- max(which(numdosages < complex))
  alldigits <- c("0","1","2","3","4","5","6","7","8","9", LETTERS)
  #we translate the intratios of non-complex segtypes to a sequence of digits
  #and uppercase letters (for ratios 10:35); if any ratios > 35 occur (as in
  #octo- and dodecaploids) we assign one of the lowercase letters for each of
  #these dosages (and assume there are less than 27 unique ratios above 35).
  #Just using the lowercase letter for ratios 36:61 does not work as already
  #with octoploids some intratios of non-complex segtypes are above 61.
  simpleratios <- sort(simpleratios[simpleratios > 35])
  for (r in 1:simple) {
    chr <- character(complex-1)
    name <- ""
    ir <- result[[r]]$intratios
    for (i in 1:(ploidyF1+1)) {
      if (ir[i] > 0) {
        if (ir[i] <= 35) {
          name <- paste(name, alldigits[ir[i] + 1], sep="")
        } else name <- paste(name, letters[which(simpleratios == ir[i])], sep="")
      }
    }
    names(result)[r] <- paste(name, firstdosage[r], sep="_")
  }
  if (complex <= max(numdosages)) for (n in complex:max(numdosages)) {
    #we want to assign codes like c6a_1, where:
    #c just means complex
    #6 is the number of different dosages in the segtype
    #c6a, c6b etc are segtypes with different ratio combinations, all with 6 dosages
    #_1 means that the first dosage is simplex
    selnum <- numdosages == n
    maxrat <- sort(unique(maxratio[selnum]))
    for (m in 1:length(maxrat)) {
      seltype <- selnum & maxratio == maxrat[m]
      if (length(maxrat) == 1) name <- paste("c", n, sep="") else
        name <- paste("c", n, letters[m], sep="")
      firstrange <- sort(unique(firstdosage[selnum]))
      for (f in firstrange) {
        sel <- seltype & firstdosage == f
        names(result)[sel] <- paste(name, f, sep="_")
      }
    }
  }
  result
} #calcSegtypeInfo

#'@title conversion of segtype code to F1 segregation ratios
#'@description Produce a matrix with the F1 segregation ratios (as integers)
#'for all segregation types for the given ploidy
#'@usage listSegtypes(ploidy, ploidy2=NULL)
#'@param ploidy the ploidy of parent 1 of the F1, or of the F1 itself if
#'ploidy2 is NULL
#'@param ploidy2 the ploidy of parent 2 of the F1
#'@details If ploidy2 is not NULL, ploidy and ploidy2 are the ploidy levels of
#'the two parents, and both must be even. If ploidy2 is NULL, ploidy is
#'the ploidy of the F1; if even, both parents are assumed to have the same
#'ploidy; if odd, parent 1 and parent 2 are asumed to have ploidy-1 and ploidy+1.
#'This function calls calcSegtypeInfo and is a convenience function; it is not
#'used by any other functions. For more information, including parental
#'dosages for each segregation, use calcSegtypeInfo and segtypeInfoSummary.
#'@return a matrix with one row for each segregation type and one column for
#'each possible F1 dosage, with the integer ratios of the dosages for each
#'segregation type.
#'@export
listSegtypes <- function(ploidy, ploidy2=NULL) {
  if (is.null(ploidy2)) {
    if (ploidy %% 2 != 0) {
      #odd ploidy
      ploidy2 <- ploidy +1
      ploidy <- ploidy - 1
    } else ploidy2 <- ploidy
  } else {
    if (ploidy %% 2 != 0 || ploidy2 %%2 != 0)
      stop("listSegtypes: odd parental ploidy not allowed")
  }
  ploidyF1 <- (ploidy+ploidy2)/2
  seginfo <- calcSegtypeInfo(ploidy=ploidy, ploidy2=ploidy2)
  result <- matrix(nrow=length(seginfo), ncol=ploidyF1 + 1,
                   dimnames=list(names(seginfo), 0:ploidyF1))
  for (i in 1:length(seginfo))
    result[i,] <- seginfo[[i]]$intratios
  result
} #listSegtypes

# selSegtypeInfo ***********************************************************
#'@title Restrict a list of segregation types to specified inheritance modes
#'@description From a list of segregation types as produced by calcSegtypeInfo,
#'this function selects only those segtypes that occur with polysomic,
#'disomic and/or mixed inheritance if the corresponding parameters are set to
#'TRUE, and from those segtypes only the parental dosages with the same
#'restrictions are retained.
#'@usage selSegtypeInfo(segtypeInfo, polysomic, disomic, mixed,
#'selfing=FALSE)
#'@param segtypeInfo a list as returned by calcSegtypeInfo
#'@param polysomic If TRUE all segtypes with poly TRUE are retained, and from
#'those segtypes all parental dosage combinations with parmode[,1] TRUE
#'@param disomic If TRUE all segtypes with di TRUE are retained, and from
#'those segtypes all parental dosage combinations with parmode[,2] TRUE
#'@param mixed If TRUE all segtypes with mixed TRUE are retained, and from
#'those segtypes all parental dosage combinations with parmode[,3] TRUE
#'@param selfing if TRUE only segtypes are retained that can be the result of
#'a selfing, i.e. both parental dosages equal (although it allows disomic
#'segtypes where the parental distribution of alleles over the subgenomes is
#'different, e.g. aa/bb x ab/ab -> 121_1); default FALSE
#'@return A list like segtypeInfo, modified as specified by parameters polysomic,
#'disomic and mixed
#'@export
selSegtypeInfo <- function(segtypeInfo, polysomic, disomic, mixed,
                           selfing=FALSE) {
  result <- list()
  selected <- integer(0)
  for (s in seq_along(segtypeInfo)) {
    if ((polysomic && segtypeInfo[[s]]$polysomic) ||
        (disomic && segtypeInfo[[s]]$disomic) ||
        (mixed && segtypeInfo[[s]]$mixed)) {
      stype <- segtypeInfo[[s]]
      selpar <- (polysomic & stype$parmode[,1]) |
        (disomic & stype$parmode[,2]) |
        (mixed & stype$parmode[,3])
      stype$pardosage <- stype$pardosage[selpar,, drop=FALSE]
      stype$parmode <- stype$parmode[selpar,, drop=FALSE]
      if (selfing) {
        selpar <- stype$pardosage[,1] == stype$pardosage[,2]
        stype$pardosage <- stype$pardosage[selpar,, drop=FALSE]
        stype$parmode <- stype$parmode[selpar,, drop=FALSE]
      }
      if (nrow(stype$pardosage) > 0) {
        selected <- c(selected, s)
        r <- length(result) + 1
        result[[r]] <- stype
      }
    }
  }
  names(result) <- names(segtypeInfo)[selected]
  result
} #selSegtypeInfo

# segtypeInfoSummary *********************************************************
#'@title Summarize the segtypeInfo list
#'@description From a list of segregation types as produced by calcSegtypeInfo
#'or selSegtypeInfo, produce a data frame that only lists the parental
#'dosage combinations for each segtype and whether these produce the
#'segtype under polysomic, disomic and/or mixed inheritance.
#'Useful to quickly look up which segtypes match a given parental dosage
#'combination.
#'@usage segtypeInfoSummary(segtypeInfo)
#'
#'@param segtypeInfo a list as returned by calcSegtypeInfo or selSegtypeInfo
#'@return A data frame summarizing the segtypeInfo list, with columns:
#'\itemize{
#'\item{segtype: the name of the segregation type (see details of
#'calcSegtypeInfo)}
#'\item{segtypenr: the sequential number of the segtype in parameter segtypeInfo}
#'\item{parent1, parent2: dosages of the two parents}
#'\item{par.poly, par.di, par.mixed: whether these parental dosages produce this
#'segtype under polysomic, disomic and/or mixed inheritance}
#\item{segtype.poly, segtype.di, segtype.mixed: whether this segtype does occur
#under polysomic, disomic and/or mixed inheritance}
#'}
#'@export
segtypeInfoSummary <- function(segtypeInfo) {
  totrows <- 0
  for (i in 1:length(segtypeInfo))
    totrows <- totrows + nrow(segtypeInfo[[i]]$pardosage)
  result <- data.frame (segtype=character(totrows),
                        segtypenr=integer(totrows),
                        parent1=integer(totrows),
                        parent2=integer(totrows),
                        par.poly=integer(totrows),
                        par.di=integer(totrows),
                        par.mixed=integer(totrows),
                        #segtype.poly=integer(totrows),
                        #segtype.di=integer(totrows),
                        #segtype.mixed=integer(totrows),
                        stringsAsFactors=FALSE)
  r <- 1
  for (i in 1:length(segtypeInfo)) {
    for (p in 1:nrow(segtypeInfo[[i]]$pardosage)) {
      result$segtype[r] = names(segtypeInfo)[i]
      result$segtypenr[r] = i
      result$parent1[r] = segtypeInfo[[i]]$pardosage[p,1]
      result$parent2[r] = segtypeInfo[[i]]$pardosage[p,2]
      result$par.poly[r] = segtypeInfo[[i]]$parmode[p,1]
      result$par.di[r] = segtypeInfo[[i]]$parmode[p,2]
      result$par.mixed[r] = segtypeInfo[[i]]$parmode[p,3]
      #result$segtype.poly[r] = 0 + segtypeInfo[[i]]$polysomic
      #result$segtype.di[r] = 0 + segtypeInfo[[i]]$disomic
      #result$segtype.mixed[r] = 0 + segtypeInfo[[i]]$mixed
      r <- r + 1
    }
  }
  result
} #segtypeInfoSummary

getMatchParents <- function(parGeno, seginfoItem) {
  #parGeno: integer vector of length 2 with the two parental (consensus)
  #         dosages (can each be NA)
  #seginfoItem: one of the items (segtypes) of a list as returned by
  #          calcSegtypeInfo
  #does the specified segtype match the parental genotypes?
  p <- which(!is.na(parGeno))
  if (length(p) == 0) {
    #both parental genotypes unknown
    return("Unknown")
  } else if (length(p) == 1) {
    #only one of the parental genotypes known
    if (sum(seginfoItem$pardosage[, p] == parGeno[p]) > 0)
      return("OneOK") else return("No")
  } else {
    #both parental genotypes known
    i <- which(seginfoItem$pardosage[, 1] == parGeno[1])
    if (length(i) == 1 && parGeno[2] == seginfoItem$pardosage[i, 2])
      return("Yes") else return("No")
  }
} #getMatchParents

getConsensusGeno <- function(geno, maxNAfrac=0.499,
                             lowconf.NAfrac=0.751) {
  #geno: a vector with scores (0..ploidy or NA) for all samples of one parent.
  #maxNAfrac: the maximum fraction of missing scores allowed to assign a
  #        high-confidence consensus geno score; if more are
  #        missing a missing geno score is returned (the default requires
  #        MORE than half of the samples to be scored, so one missing out of
  #        2 samples already causes a missing geno)
  #lowconf.NAfrac: if the fraction missing scores is more than maxNAfrac but
  #        less than lowconf.NAfrac, or if there is only one sample,
  #        a lowconf.geno score is assigned
  #returns a list with 3 components:
  # $geno: the consensus of the dosages in vector geno. NA if geno is empty,
  #        if all elements of geno are NA, if there are different non-NA
  #        values in geno, or if the fraction of NA values in geno larger than
  #        maxNAfrac
  # $lowconf.geno: if there is no conflict and the fraction of missing scores
  #        is between maxNAfrac and lowconf.NAfrac, the consensus is assigned
  #        as a "low-confidence" geno - this needs confirmation from the
  #        F1 segregation to be accepted
  # $conflict: TRUE if there are different non-NA values in geno, else FALSE
  # $NAfrac: the fraction NA in geno (0.5 if length(geno)==0)
  nwgeno <- NA; nwlogeno <- NA; conflict <- FALSE; NAfrac <- 0
  if (length(geno) == 0) NAfrac <- 0.5 else {   #we need a NAfrac value for q2
    NAfrac <- sum(is.na(geno) / length(geno))
    if (NAfrac < 1.0) {
      if (length(geno) == 1) nwlogeno <- geno else {
        if (max(geno, na.rm=TRUE) != min(geno, na.rm=TRUE)) {
          conflict <- TRUE
        } else  if (NAfrac <= maxNAfrac) {
          nwgeno <- min(geno, na.rm=TRUE)
        } else if (NAfrac <= lowconf.NAfrac) {
          nwlogeno <- min(geno, na.rm=TRUE)
        }
      }
    }
  }
  list(geno=nwgeno, lowconf.geno=nwlogeno, conflict=conflict, NAfrac=NAfrac)
} #getConsensusGeno

shiftdosages <- function(dosages, shift, ploidy) {
  #dosages: vector of integer dosages, shift: one shift value
  dosages <- dosages + shift
  below <- !is.na(dosages) & dosages < 0
  dosages[below] <- 0
  above <- !is.na(dosages) & dosages > ploidy
  dosages[above] <- ploidy
  dosages
} #shiftdosages

checkSelfing <- function(parent1, parent2) {
  parent1 <- sort(as.character(parent1[!is.na(parent1)]))
  parent2 <- sort(as.character(parent2[!is.na(parent2)]))
  length(parent1) == length(parent2) && length(parent1) > 0 &&
    sum(parent1 != parent2) == 0
} #checkSelfing

linterpol <- function(x, pnt1, pnt2) {
  #linear interpolation: returns the y values matching the x values (vector)
  #on the line through points pnt1 and pnt2 (both are vectors of length 2)
  #possible error if both X-coordinates equal not checked or caught
  a <- (pnt2[2] - pnt1[2]) / (pnt2[1] - pnt1[1])
  b <- pnt1[2] - a * pnt1[1]
  a * x + b
} #linterpol

calc_q1 <- function(Pvalue, fracInvalid, bestfit, bestParentfit,
                    Pvalue_threshold, fracInvalid_threshold) {
  #q1 is a measure of the fit of selfit. It is based on Pvalue and the fraction
  #of valid F1 scores. If either is below their threshold the q1 value is 0.
  #Also, if bestParentfit is different from bestfit this indicates a segregation
  #distortion.
  #x compares bestfit and bestParentfit:
  if (bestfit == bestParentfit) {
    x <- 1
  } else {
    x <- 0.5
  }
  #Version 20150220: nonzero but low y values also below P=0.001 depending
  #on threshold:
  if (Pvalue < Pvalue_threshold) y <- 0 else {
    if (Pvalue < 0.01) {
      if (Pvalue < 0.001) y <- linterpol(Pvalue, c(0,0), c(0.001, 0.05)) else
        y <- linterpol(Pvalue, c(0.001, 0.05), c(0.01, 0.6))
    } else if (Pvalue < 0.15) {
      if (Pvalue < 0.05) y <- linterpol(Pvalue, c(0.01, 0.6), c(0.05, 0.8)) else
        y <- linterpol(Pvalue, c(0.05, 0.8), c(0.15, 1))
    } else y <- 1
  }
  #z is determined by fraction F1 scores in valid categories for the selfit
  #segtype. z is equal to 1-fracInvalid, but below 1-frqinvalid.threshold z=0.
  z <- ifelse(fracInvalid > fracInvalid_threshold, 0, 1 - fracInvalid)
  x * y * z
} #calc_q1

calc_q2 <- function(par.conflicts, par.NAfrac, matchParents,
                    parentsScoredWithF1) {
  #q2 is about how well the parents are scored and fit with the selfit segtype
  if (parentsScoredWithF1) {
    #q2 is determined not by matchParents but by the amount
    #of conflicts and missing values in the parents:
    q2 <- max(0, 1 - 0.5 * sum(par.conflicts) - 0.5 * sum(par.NAfrac))
    #In the version of 20150220 matchParents cannot be "No": selfit is always
    #bestParentfit (sometimes after promoting lowconf parental scores).
    #Therefore the remaining matchParents codes Unknown, OneOK and Yes are
    #correlated with the q2 score, no point to use the matchParents
  } else {
    # !parentsScoredWithF1
    # In this case par.conflicts and par.NAfrac are not used and the information
    # on the match between parents and F1 segtype has to come from matchParents.
    # Note that matchParents=="No" cannot occur in the versions since 20150220.
    q2 <- match(matchParents,
                c("No", "Unknown", "OneOK", "Yes")) #1:4 or NA
    #q2 <-  c(0, 0.5, 0.9, 1)[q2] #translate into 0..1 values
    q2 <-  c(0, 0.65, 0.9, 1)[q2] #translate into 0..1 values
    #version of 20160324: the penalty for Unknown is less if the parents
    #are not scored with the F1: there is still the lack of information, but
    #the lack of scores does not mean that the (F1) genotyping is less reliable.
  }
  q2
} #calc_q2

calc_q3 <- function(F1.NAfrac, fracNA_threshold) {
  #q3 is about the fraction missing F1 scores.
  #If that is larger than NA.threshold, q3=0;
  #else q3 depends on the fraction scored F1's
  if (F1.NAfrac > fracNA_threshold) {
    q3 <- 0
  } else if (fracNA_threshold > 0.2) {
    if (F1.NAfrac >= 0.2) {
      # 0.2<=F1.NAfrac<threshold
      q3 <- linterpol(F1.NAfrac, c(0.2, 0.5), c(fracNA_threshold, 0))
    } else {
      # 0<=F1.NAfrac<0.2
      q3 <- linterpol(F1.NAfrac, c(0, 1), c(0.2, 0.5))
    }
  } else {
    # fracNA_threshold <= 0.2, 0<=F1.NAfrac<threshold
    #: let q3 go from 1 at frac=0 to 0.5 at frac=threshold:
    q3 <- linterpol(F1.NAfrac, c(0, 1), c(fracNA_threshold, 0.5))
  }
  q3
} #calc_q3

calc_qall <- function(Pvalue_threshold, fracInvalid_threshold,fracNA_threshold,
                      Pvalue, fracInvalid, F1.NAfrac,
                      matchParents,
                      bestfit, bestParentfit,
                      par.conflicts, par.NAfrac,
                      critweight,
                      parentsScoredWithF1) {
  # 3 thresholds: can be disabled by setting to 0, 1 and 1
  # Pvalue, fracInvalid and matchParents are properties of the selfit segtype
  # F1.NAfrac is the fraction missing scores among F1 samples,
  # bestfit and bestParentfit are two segtype, selfit is one of these
  # par.conflicts is a logical vector: is there a conflict for P1, P2?
  # par.NA is the number of NA scores for the samples of P1, P2
  # critweight is a numeric vector of length 3 (3 quality criteria)
  #            with their weights, or NA (in that case qall = q1*q2*q3, not a
  #            weighted average)
  # return value: a vector of length 4 or 5. The last (two) components of the
  #               vector are qall_mult (obtained by multiplication of q1..q3),
  #               and if critweights is supplied, qall_weights (a weighted
  #               average of q1..q3)
  q <- numeric(ifelse (length(critweight) == 3, 5, 4)) #last (two) are qall
  q[1] <- calc_q1(Pvalue, fracInvalid, bestfit, bestParentfit,
                  Pvalue_threshold, fracInvalid_threshold)
  q[2] <- calc_q2(par.conflicts, par.NAfrac, matchParents,
                  parentsScoredWithF1)
  q[3] <- calc_q3(F1.NAfrac, fracNA_threshold)
  q[4] <- prod(q[1:3]) #q[4] is qall_mult
  if (length(critweight) == 3) {
    q[5] <- sum(q[1:3] * critweight) / sum(critweight) #q[5] is qall_weights
  }
  q
} #calc_qall

# checkF1 *******************************************************************
#'@title Identify the best-fitting F1 segregation types
#'@description For a given set of F1 and parental samples, this function
#'finds the best-fitting segregation type. It can perform a dosage shift prior
#'to selecting the segregation type.
#'
#'@usage checkF1(scores, parent1, parent2, F1, ancestors=character(0),
#'polysomic, disomic, mixed, ploidy, ploidy2, outfile,
#'critweight=c(1.0, 0.4, 0.4),
#'scores_long=TRUE, Pvalue_threshold=0.0001, fracInvalid_threshold=0.05,
#'fracNA_threshold=0.25, shiftmarkers, parentsScoredWithF1,
#'shiftParents=parentsScoredWithF1, showAll=FALSE, append_shf=FALSE)
#'
#'@param scores A data frame as read from the scores file produced by function
#'fitMarkers of package fitPoly (or a subset with at least columns
#'MarkerName, SampleName and geno), or a data frame as returned by function
#'scores2wide. In the first case (default) parameter scores_long must be TRUE,
#'in the second case it must be FALSE.
#'@param parent1 character vector with the sample names of parent 1
#'@param parent2 character vector with the sample names of parent 2
#'@param F1 character vector with the sample names of the F1 individuals
#'@param ancestors character vector with the sample names of any other
#'ancestors or other samples of interest. The dosages of these samples will
#'be shown in the output (shifted if shiftParents TRUE) but they are not used
#'in the selection of the segregation type.
#'@param polysomic if TRUE at least all polysomic segtypes are considered;
#'if FALSE these are not specifically selected (but if e.g. disomic is TRUE,
#'any polysomic segtypes that are also disomic will still be considered)
#'@param disomic if TRUE at least all disomic segtypes are considered (see
#'param polysomic)
#'@param mixed if TRUE at least all mixed segtypes are considered (see
#'param polysomic). A mixed segtype occurs when inheritance in one parent is
#'polysomic (random chromosome pairing) and in the other parent disomic (fully
#'preferential chromosome pairing)
#'@param ploidy The ploidy of parent 1 (must be even, 2 (diploid) or larger).
#'@param ploidy2 The ploidy of parent 2. If omitted it is
#'assumed to be equal to ploidy.
#'@param outfile the tab-separated text file to write the output to; if NA a
#'temporary file checkF1.tmp is created in the current working directory
#'and deleted at end
#'@param critweight NA or a numeric vector containing the weights of three quality
#'criteria; do not need to sum to 1. If NA, the output will not contain a
#'column qall_weights. Else the weights specify how qall_weights will be
#'calculated from quality parameters q1, q2 and q3.
#'@param scores_long TRUE if scores is in "long format", FALSE if it is in
#'"wide format" (see parameter scores)
#'@param Pvalue_threshold a minimum threshold value for the Pvalue of the
#'bestParentfit segtype (with a smaller Pvalue the q1 quality parameter will
#'be set to 0)
#'@param fracInvalid_threshold a maximum threshold for the fracInvalid of the
#'bestParentfit segtype (with a larger fraction of invalid dosages in the F1
#'the q1 quality parameter will be set to 0)
#'@param fracNA_threshold a maximum threshold for the fraction of unscored F1
#'samples (with a larger fraction of unscored samples in the F1
#'the q3 quality parameter will be set to 0)
#'@param shiftmarkers if specified, shiftmarkers must be a data frame with
#'columns MarkerName and shift; for the markernames that match exactly
#'(upper/lowercase etc) those in scores, the dosages are increased by the
#'amount specified in column shift,
#'e.g. if shift is -1, dosages 2..ploidy are converted to 1..(ploidy-1)
#'and dosage 0 is a combination of old dosages 0 and 1, for all samples.
#'The segregation check
#'is then performed with the shifted dosages.
#'A shift=NA is allowed, these markers will not be shifted.
#'The sets of markers in scores and shiftmarkers
#'may be different, but markers may occur only once in shiftmarkers.
#'A column shift is added at the end of the returned data frame.\cr
#'If parameter shiftParents is TRUE, the parental and ancestor scores are
#'shifted as the F1 scores, if FALSE they are not shifted.
#'@param parentsScoredWithF1 TRUE means parents are scored in the same experiment
#'and the same fitPoly run as the F1, else FALSE.
#'If TRUE, the fraction missing scores and conflicts in the parents
#'tell something about the quality of the scoring. If FALSE
#'(e.g. when the F1 is triploid and the parents are diploid and tetraploid) the
#'quality of the F1 scores can be independent of that of the parents.\cr
#'If not specified, TRUE is assumed if ploidy2 == ploidy and FALSE if
#'ploidy2 != ploidy
#'@param shiftParents only used if parameter shiftmarkers is specified. If TRUE,
#'apply the shifts also to the parental and ancestor scores.
#'By default TRUE if parentsScoredWithF1 is TRUE, else FALSE
#'@param showAll (default FALSE) if TRUE, for each segtype 3 columns
#'are added to the returned data frame with the frqInvalid, Pvalue and
#'matchParents values for these segtype (see the description of the return value)
#'@param append_shf if TRUE and parameter shiftmarkers is specified, _shf is
#'appended to all marker names where shift is not 0. This is not required for
#'any of the functions in this package but may prevent duplicated marker names
#'when using other software.

#'@details For each marker is tested how well the different segregation types
#'fit with the observed parental and F1 dosages. The results are summarized
#'by columns bestParentfit (which is the best fitting segregation type,
#'taking into account the F1 and parental dosages) and columns qall_mult
#'and/or qall_weights (how good is the fit of the bestParentfit segtype: 0=bad,
#'1=good).\cr
#'Column bestfit in the results gives the segtype best fitting the F1
#'segregation without taking account of the parents. This bestfit segtype is
#'used by function correctDosages, which tests for possible "shifts" in
#'the marker models. Both bestfit and bestParentfit are restricted by the
#'parameters polysomic, disomic and mixed. Further they are restricted to
#'segtypes that can only occur when the parental dosages are equal, if
#'parent1 and parent2 list the same samples (but not if both are empty).\cr
#'In case the parents are not scored together with the F1 (e.g. if the F1 is
#'triploid and the parents are diploid and tetraploid) the scores data frame
#'should be edited to contain the parental as well as the F1 scores.
#'In case the diploid and tetraploid parent are scored in the same run of
#'function fitMarkers (package fitPoly)
#'the diploid is initially scored as nulliplex-duplex-quadruplex (dosage 0, 2
#'or 4); that must be converted to the true diploid dosage scores (0, 1 or 2).
#'Similar corrections are needed with other combinations, such as a diploid
#'parent scored together with a hexaploid population etc.
#'
#'@return A data frame with one row per markers, with the following columns:
#'\itemize{
#'\item{m: the sequential number of the marker (as assigned by fitPoly)}
#'\item{MarkerName: the name of the marker, with _shf appended if the marker
#'is shifted and append_shf is TRUE}
#'\item{parent1: consensus dosage score of the samples of parent 1}
#'\item{parent2: consensus dosage score of the samples of parent 2}
#'\item{F1_0 ...	F1_<ploidy>: the number of F1 samples with dosage scores
#'0 ... <ploidy>}
#'\item{F1_NA: the number of F1 samples with a missing dosage score}
#'\item{sample names of parents and ancestors: the dosage scores for those
#'samples}
#'\item{bestfit: the best fitting segtype, considering only the F1 samples}
#'\item{frqInvalid_bestfit: for the bestfit segtype, the frequency of F1 samples
#'with a dosage score that is invalid (that should not occur). The frequency is
#'calculated as the number of invalid samples divided by the number of non-NA
#'samples}
#'\item{Pvalue_bestfit: the chisquare test P-value for the observed
#'distribution of dosage scores vs the expected fractions. For segtypes
#'where only one dosage is expected (1_0, 1_1 etc) the binomial probability of
#'the number of invalid scores is given, assuming an error
#'rate of seg_invalidrate (hard-coded as 0.03)}
#'\item{matchParent_bestfit: indication how the bestfit segtype matches the
#'consensus dosages of parent 1 and 2: "Unknown"=both parental
#'dosages unknown; "No"=one or both parental dosages known
#'and conflicting with the segtype; "OneOK"= only one parental
#'dosage known, not conflicting with the segtype; "Yes"=both
#'parental dosages known and combination matching with
#'the segtype. This score is initially assigned based on
#'only high-confidence parental consensus scores; if
#'low-confidence dosages are confirmed by the F1, the
#'matchParent for (only) the selected segtype is
#'updated, as are the parental consensus scores.}
#'\item{bestParentfit: the best fitting segtype that does not conflict with
#'the parental consensus scores}
#'\item{frqInvalid_bestParentfit, Pvalue_bestParentfit,
#'matchParent_bestParentfit: same as the corresponding columns for bestfit.
#'Note that matchParent_bestParentfit cannot be "No".}
#'\item{q1_segtypefit: a value from 0 (bad) to 1 (good), a measure of the fit of
#'the bestParentfit segtype based on Pvalue, invalidP and whether bestfit is
#'equal to bestParentfit}
#'\item{q2_parents: a value from 0 (bad) to 1 (good), based either on the
#'quality of the parental scores (the number of missing scores and of
#'conflicting scores, if parentsScoredWithF1 is TRUE) or on matchParents
#'(No=0, Unknown=0.65, OneOK=0.9, Yes=1, if parentsScoredWithF1 is FALSE)}
#'\item{q3_fracscored: a value from 0 (bad) to 1 (good), based on the fraction
#'of F1 samples that have a non-missing dosage score}
#'\item{qall_mult: a value from 0 (bad) to 1 (good), a summary quality score
#'equal to the product q1*q2*q3. Equal to 0 if any of these is 0, hence
#'sensitive to thresholds; a natural selection criterion would be to accept
#'all markers with qall_mult > 0}
#'\item{qall_weights: a value from 0 (bad) to 1 (good), a weighted average of
#'q1, q2 and q3, with weights as specified in parameter critweight. This column is
#'present only if critweight is specified. In this case there is no "natural"
#'threshold; a threshold for selection of markers must be obtained by inspecting
#'XY-plots of markers over a range of qall_weights values}
#'\item{shift: if shiftmarkers is specified a column shift is added with
#'for all markers the applied shift (for the unshifted markers the shift value
#'is 0)}
#'}
#'qall_mult and/or qall_weights can be used to compare the quality
#'of the SNPs within one analysis and one F1 population but not between analyses
#'or between different F1 populations.\cr
#'If parameter showAll is TRUE there are 3 additional columns for each
#'segtype with names frqInvalid_<segtype>, Pvalue_<segtype> and
#'matchParent_<segtype>; see the corresponding columns for bestfit for an
#'explanation. These extra columns are inserted directly before the bestfit
#'column.
#'@export
checkF1 <- function(scores,
                    parent1,
                    parent2,
                    F1,
                    ancestors=character(0),
                    polysomic, disomic, mixed,
                    ploidy, ploidy2,
                    outfile,
                    critweight=c(1.0, 0.4, 0.4),
                    scores_long=TRUE,
                    Pvalue_threshold=0.0001,
                    fracInvalid_threshold=0.05,
                    fracNA_threshold=0.25,
                    shiftmarkers,
                    parentsScoredWithF1,
                    shiftParents=parentsScoredWithF1,
                    showAll=FALSE,
                    append_shf=FALSE) {
  if (ploidy %% 2 != 0) stop("checkF1: odd ploidy not allowed")
  if (missing(ploidy2) || is.na(ploidy2)) ploidy2 <- ploidy else
    if (ploidy2 %% 2 != 0) stop("checkF1: odd ploidy2 not allowed")
  ploidyF1 <- (ploidy + ploidy2) / 2

  if (missing(parentsScoredWithF1))
    parentsScoredWithF1 <- ploidy2 == ploidy

  #check critweight
  if (is.null(critweight) || is.na(critweight[1])) critweight <- NA else {
    if (!is.numeric(critweight) || length(critweight) != 3 ||
        sum(is.na(critweight)) > 0 || sum(critweight) == 0) {
      stop("invalid critweight")
    }
  }
  #check shiftmarkers:
  if (missing(shiftmarkers)) shiftmarkers <- NA
  if (is.data.frame(shiftmarkers)) {
    if (sum(is.na(match(c("MarkerName", "shift"), names(shiftmarkers)))) > 0)
      stop("checkF1: shiftmarkers must have columns MarkerName and shift")
    shiftmarkers$shift[is.na(shiftmarkers$shift)] <- 0
    if (nrow(shiftmarkers) == 0) shiftmarkers <- NA else {
      if (sum(is.na(shiftmarkers$shift)) > 0 ||
          sum(!(shiftmarkers$shift %in% ((-ploidyF1):ploidyF1))) > 0)
        stop("checkF1: shiftmarkers contains invalid shift values")
      shiftmarkers$MarkerName <- as.character(shiftmarkers$MarkerName)
      if (length(unique(shiftmarkers$MarkerName)) < nrow(shiftmarkers))
        stop("checkF1: some markernames occur more than once in shiftmarkers")
    }
  }
  seg_invalidrate <- 0.03 #assumed percentage of scoring errors leading to invalid dosages
  file.del <- is.na(outfile) || outfile == ""
  if (file.del) outfile <- "checkF1.tmp"
  if (!checkFilename(outfile))
    stop(paste("checkF1: cannot write file", outfile))
  #fill in missing vectors and/or clean up:
  if (missing(parent1) || is.logical(parent1)) parent1 <- character(0)
  if (missing(parent2) || is.logical(parent2)) parent2 <- character(0)
  if (missing(ancestors) || is.logical(ancestors)) ancestors <- character(0)
  parent1 <- sort(as.character(parent1[!is.na(parent1)]))
  parent2 <- sort(as.character(parent2[!is.na(parent2)]))
  ancestors <- as.character(ancestors[!is.na(ancestors)])
  #select the relevant segtypes:
  if (!polysomic && !disomic && !mixed)
    stop("checkF1: at least one of polysomic, disomic and mixed must be TRUE")
  seginfo <- calcSegtypeInfo(ploidy, ploidy2)
  allsegtypenames <- names(seginfo)
  seginfo <- selSegtypeInfo(seginfo, polysomic, disomic, mixed,
                            selfing=checkSelfing(parent1, parent2))
  seginfoSummary <- segtypeInfoSummary(seginfo)

  # names of quality parameters:
  qnames <- c("q1_segtypefit"," q2_parents","q3_fracscored","qall_mult")
  if (length(critweight) == 3) qnames[5] <- "qall_weights"

  # check scores:
  if (scores_long) {
    if (anyNA(match(c("marker", "MarkerName", "SampleName", "geno"),
                    names(scores))))
      stop("checkF1: some required columns missing from scores")
    if (anyNA(scores$marker) || anyNA(scores$MarkerName) ||
        anyNA(scores$SampleName))
      stop("checkF1: no missing values allowed in columns marker, MarkerName or SampleName of scores")
    mrknames <- sort(as.character(unique(scores$MarkerName)))
  } else {
    if (length(names(scores)) < 2 ||
        !identical(c("marker", "MarkerName"), names(scores)[1:2]))
      stop("checkF1: first two columns of scores should be marker and MarkerName")
    if (anyNA(scores$marker) || anyNA(scores$MarkerName))
      stop("checkF1: no missing values allowed in columns marker and MarkerName of scores")
    mrknames <- as.character(scores$MarkerName)
  }

  createResultsdf <- function(mrkcount) {
    #function within checkF1
    saf <- getOption("stringsAsFactors")
    options(stringsAsFactors = FALSE)
    #create the results data frame for this batch:
    mat <- matrix(
      integer((2+ploidyF1+2+length(parent1)+length(parent2)+length(ancestors)) *
                mrkcount), nrow=mrkcount)
    colnames(mat) <- c("parent1", "parent2", paste("F1", 0:ploidyF1, sep="_"),
                       "F1_NA", parent1, parent2, ancestors)
    #segdf is the template for all individual segtype data frames
    #(sets of 3 columns, and one row for each marker):
    segdf <- data.frame(frqInvalid=character(mrkcount), #printed as formatted strings
                        Pvalue=character(mrkcount),
                        matchParent=factor(rep(NA,mrkcount),
                                           levels=c("No","OneOK","Unknown","Yes")))
    #bres is the data frame that will hold the results (one row per marker)
    bres <- data.frame(
      m=integer(mrkcount),
      MarkerName=character(mrkcount),
      mat
    )
    if (showAll) for (sg in 1:length(seginfo)) {
      df <- segdf
      names(df) <- paste(names(df), names(seginfo)[sg], sep="_")
      bres <- data.frame(bres, df)
    }
    segdf <- data.frame(
      fit = factor(rep(NA, mrkcount), levels=names(seginfo)),
      segdf
    )
    for (fi in c("bestfit", "bestParentfit")) {
      df <- segdf
      names(df) <- c(fi,
                     paste(names(df)[2:4], fi, sep="_"))
      bres <- data.frame(bres, df)
    }
    bres <- data.frame (
      bres,
      q1_segtypefit=character(mrkcount),
      q2_parents=character(mrkcount),
      q3_fracscored=character(mrkcount),
      qall_mult=character(mrkcount)
    )
    if (length(critweight) == 3)
      bres <- data.frame(
        bres,
        qall_weights=character(mrkcount)
      )
    if (is.data.frame(shiftmarkers))
      bres <- data.frame(
        bres,
        shift=integer(mrkcount))
    options(stringsAsFactors = saf)
    bres
  } #createResultsdf within checkF1

  segtypeBestSelcrit <- function(candidates) {
    #function within checkF1
    #candidates is a vector indexing the segtypes in results from which
    #           the one with the best selcrit must be selected
    #return value: index to that segtype, or 0 if none
    if (length(candidates) == 0) return(0)
    candSelcrit <- selcrit[candidates]
    candidates[which.max(candSelcrit)]
  } # segtypeBestSelcrit within checkF1

  compareFit <- function(newsegtype, oldsegtype) {
    #function within checkF1
    #newsegtype, oldsegtype: two indices into results
    #return value: TRUE if newsegtype is selected,
    #              FALSE if oldsegtype is selected
    (results$fracInvalid[newsegtype] <=
       max(0.05, 1.5 * results$fracInvalid[oldsegtype])) &&
      (results$Pvalue[newsegtype] >=
         min(0.01, 0.1 * results$Pvalue[oldsegtype]))
  } #compareFit within checkF1

  #subsetting from a big scores data.frame takes a lot of time
  #(approx 2 sec for each marker, in a file of ~26000 markers and 480 samples)
  #Therefore we subdivide it in batches of 100 markers
  #if not scores_long this probably is less relevant but doesn't do any harm
  batchsize <- 100
  batchnr <- 1
  while (batchsize * (batchnr-1) < length(mrknames)) {
    minmrk <- batchsize*(batchnr-1) + 1
    maxmrk <- min(length(mrknames), batchsize*batchnr)
    if (scores_long) {
      #batchscores <- subset(scores, subset=markername %in% mrknames[minmrk:maxmrk]) #only 260 * 2 sec
      batchscores <- scores[scores$MarkerName %in% mrknames[minmrk:maxmrk],] #only 260 * 2 sec
    } else {
      batchscores <- scores[scores$MarkerName %in% mrknames[minmrk:maxmrk],,
                            drop=FALSE]
    }
    bres <- createResultsdf(maxmrk - minmrk + 1) #bres: batchresults
    for (mrk in minmrk:maxmrk) {
      if (scores_long) {
        sc <- batchscores[batchscores$MarkerName == mrknames[mrk],] #very fast now
        if ("marker" %in% names(sc)) {
          mrknr <- sc$marker[1] #marker number in fitPoly input
        } else mrknr <- mrk
      } else {
        mrknr <- which(batchscores$MarkerName == mrknames[mrk])
        sc <- data.frame(SampleName=names(batchscores)[3:length(batchscores)],
                         geno=as.integer(batchscores[mrknr, 3:length(batchscores)]))
        mrknr <- batchscores$marker[mrknr] #marker number in fitPoly input
      }
      #get all the parent and ancestor genotypes in the order of the given vectors:
      parent.geno <- list()
      parent.geno[[1]] <- sc$geno[match(parent1, sc$SampleName)]
      parent.geno[[2]] <- sc$geno[match(parent2, sc$SampleName)]
      ancestors.geno <- sc$geno[match(ancestors, sc$SampleName)]
      F1.geno <- sc$geno[sc$SampleName %in% F1] #F1 does not have to be ordered

      shift <- 0
      if (is.data.frame(shiftmarkers)) {
        whichshift <- which(shiftmarkers$MarkerName == mrknames[mrk])
        if (length(whichshift) == 1) {
          shift <- shiftmarkers$shift[whichshift]
          if (shift != 0) {
            #changed: now all shifted dosages outside 0..ploidyF1 are set
            #to 0 or ploidyF1 (i.e. dosages that were separated are now merged),
            #before 20161130 they were set to NA
            #F1.geno <- ancestors.geno + shift
            #F1.geno[!(F1.geno %in% 0:ploidyF1)] <- NA
            F1.geno <- shiftdosages(F1.geno, shift, ploidyF1)
            if (shiftParents) {
              #also here new shiftdosages is used:
              parent.geno[[1]] <- shiftdosages(parent.geno[[1]], shift, ploidyF1)
              parent.geno[[2]] <- shiftdosages(parent.geno[[2]], shift, ploidyF1)
              ancestors.geno <- shiftdosages(ancestors.geno, shift, ploidyF1)
            }
          }
        }
      }

      #get the consensus genotype and conflict status of each of the two parents:
      par.geno <- c(0, 0) #integer
      par.lowconf.geno <- c(0, 0) #integer
      par.conflicts <- c(FALSE, FALSE) #logical
      par.NAfrac <- c(0.5, 0.5) #floatingpoint
      for (parent in 1:2) {
        #if (strict.parents) maxNAfrac <- 0.499 else maxNAfrac <- 1
        parresult <- getConsensusGeno(geno=parent.geno[[parent]],
                                      maxNAfrac=0.499,
                                      lowconf.NAfrac=0.751)
        par.geno[parent] <- parresult$geno
        par.lowconf.geno[parent] <- parresult$lowconf.geno
        par.conflicts[parent] <- parresult$conflict
        par.NAfrac[parent] <- parresult$NAfrac
      }

      #get the freq. distribution of the F1 samples:
      F1.naCount <- sum(is.na(F1.geno))
      F1.nobs <- length(F1.geno) - F1.naCount #the number of non-NA F1 samples
      #F1.counts <- hist(F1.geno, breaks=seq(-0.5, ploidyF1+0.5, by=1), plot=FALSE)$counts
      F1.counts <- tabulate(bin=F1.geno + 1, nbins=ploidyF1 + 1)
      #note that F1.counts[1] has the number of F1 samples with geno==0 etc !
      bestfit <- NA; bestParentfit <- NA;
      q <- rep(NA, length(qnames))
      results <- data.frame(segtype=names(seginfo),
                            fracInvalid=rep(1.0, length(seginfo)),
                            invalidP=rep(0.0, length(seginfo)),
                            Pvalue=rep(0.0, length(seginfo)),
                            matchParents=I(as.character(rep(NA,length(seginfo)))))
      if (F1.nobs > 10) {
        for (s in 1:length(seginfo)) {
          #does the current segtype match the parental genotypes?
          results$matchParents[s] <-
            getMatchParents(parGeno=par.geno, seginfoItem=seginfo[[s]])
          #calculate fraction invalid F1 scores given the current segtype:
          exp.geno <- seginfo[[s]]$expgeno #the genotypes that can occur given the segtype
          F1.invalid <-
            length(F1.geno[!(F1.geno %in% exp.geno)]) - F1.naCount #number of invalid scores
          results$fracInvalid[s] <- F1.invalid / F1.nobs
          #calculate chi-square P-value given the current segtype and invalidP value:
          if (F1.nobs - F1.invalid > 0) {
            #at least some F1 samples in expected peaks
            #(else Pvalue and invalidP remain 0)
            results$invalidP[s] <-
              pbinom(q=F1.nobs - F1.invalid,   #nr of valid scores
                     size=F1.nobs,             #total nr of scores
                     prob=1 - seg_invalidrate) #expected fraction of valid scores
            if (length(exp.geno) == 1) {
              results$Pvalue[s] <- 1.0
            } else {
              #more than one genotype expected, chisquare test
              if (sum(F1.counts[exp.geno + 1]) == 0) {
                #all observations are in invalid classes
                results$Pvalue[s] <- 0.0
                results$invalidP[s] <- 0.0
              } else {
                suppressWarnings(
                  results$Pvalue[s] <-
                    chisq.test(F1.counts[exp.geno+1],
                               p=seginfo[[s]]$freq[exp.geno+1])$p.value)
                #warnings probably generated by class counts < 5;
                #these do not worry us
                #as those P-values are likely to be very significant anyway
                #also the reported warnings are not informative
              }
            }
          }
        } # for (s in 1:length(seginfo))
        #select best fitting segtype based on fracInvalid and Pvalue
        #selcrit: selection criterion
        selcrit <- results$invalidP * results$Pvalue
        bestfit <- which.max(selcrit) #discards NA's
        if (bestfit == 0)
          stop(paste("Error in checkF1: bestfit is 0 at marker", mrknames[mrk]))

        #select which segtype best fits the parental genotype(s)
        ParentFit <- which(results$matchParents %in% c("Yes","OneOK","Unknown"))
        bestParentfit <- segtypeBestSelcrit(ParentFit)
        if (bestParentfit == 0)
          stop(paste("Error in checkF1: bestParentfit is 0 at marker", mrknames[mrk]))
        #bestParentfit is 0 if all segtypes have matchParents=="No" (should never occur)

        #Can we improve on bestParentfit by using low-confidence parental scores?
        lowParentFit <- NA
        lowc <- which(!is.na(par.lowconf.geno))
        if (length(lowc) > 0) {
          if (length(lowc) == 1) {
            if (is.na(par.geno[3 - lowc])) {
              #one parent missing, other lowconf
              low.segtypes <- seginfoSummary$segtypenr[
                seginfoSummary[, 2+lowc] == par.lowconf.geno[lowc]]
            } else {
              #one parent high conf, other lowconf score
              low.segtypes <- seginfoSummary$segtypenr[
                (seginfoSummary[, 2+lowc] == par.lowconf.geno[lowc]) &
                  (seginfoSummary[, 5-lowc] == par.geno[3-lowc])]
            }
            if (length(low.segtypes) > 0) {
              lowParentfit <- segtypeBestSelcrit(low.segtypes)
              #now we must check if lowParentfit is equal to or not
              #too much worse than bestParentfit: if so, we promote the
              #par.lowconf.geno to a true geno and lowParentfit to
              #bestParentfit
              if (compareFit(lowParentfit, bestParentfit)) {
                par.geno[lowc] <- par.lowconf.geno[lowc]
                bestParentfit <- lowParentfit
                results$matchParents[bestParentfit] <-
                  getMatchParents(parGeno=par.geno,
                                  seginfoItem=seginfo[[bestParentfit]])
              }
            }
          } else {
            #both parents have a low-conf score:
            #we should check for both scores if they are acceptable
            #(i.e. equal to, or not too much worse than bestfit)
            #If one of them is ok and not the other, we promote it and its lowParentfit;
            #if both pass and their combination is ok we promote both
            #and their combined lowParentfit;
            #if both pass but the combination does not pass we don't
            #promote either and keep bestfit as bestParentfit
            #First the combination:
            low.segtypes <- seginfoSummary$segtypenr[
              (seginfoSummary[, 3] == par.lowconf.geno[1]) &
                (seginfoSummary[, 4] == par.lowconf.geno[2])]
            lowParentfit <- segtypeBestSelcrit(low.segtypes)
            if (compareFit(lowParentfit, bestParentfit)) {
              #the combination of both par.lowconf.geno is acceptable
              par.geno <- par.lowconf.geno
              bestParentfit <- lowParentfit
              results$matchParents[bestParentfit] <-
                getMatchParents(parGeno=par.geno,
                                seginfoItem=seginfo[[bestParentfit]])
              #always "Yes"?
            } else {
              #the combination of both par.lowconf.geno is not acceptable,
              #check the two parents separately:
              lowParentfit <- c(0,0)
              for (p in 1:2) {
                low.segtypes <- seginfoSummary$segtypenr[
                  seginfoSummary[, 2+p] == par.lowconf.geno[p]]
                lowParentfit[p] <- segtypeBestSelcrit(low.segtypes)
                if (!compareFit(lowParentfit[p], bestParentfit))
                  lowParentfit[p] <- 0
              }
              p <- which(lowParentfit != 0)
              if (length(p) == 1) {
                #just one parent gives an acceptable lowParentfit, use that one:
                par.geno[p] <- par.lowconf.geno[p]
                bestParentfit <- lowParentfit[p]
                results$matchParents[bestParentfit] <-
                  getMatchParents(parGeno=par.geno,
                                  seginfoItem=seginfo[[bestParentfit]])
                #always "OneOK"?
              }
            } #two par.lowconf.geno, the combination was not acceptable
          } #two par.lowconf.geno
        } #at least one par.lowconf.geno

        #Note that we don't attempt here to resolve mismatches between the final
        #par.geno and bestParentfit, nor to fill in missing par.geno if the
        #bestParentfit segtype would allow only one solution.
        #That is done in compareProbes (and therefore also in writeDosagefile).
        #A further expansion to generate markers with all possible par.geno
        #allowed by the segtype and the known par.geno is done in
        #expandUnknownParents

        # generate a set of quality parameters of the marker in general and
        # the fit of bestParentfit:
        q <- calc_qall(Pvalue_threshold, fracInvalid_threshold, fracNA_threshold,
                       Pvalue=results$Pvalue[bestParentfit],
                       fracInvalid=results$fracInvalid[bestParentfit],
                       F1.NAfrac=F1.naCount / length(F1.geno),
                       matchParents=results$matchParents[bestParentfit],
                       bestfit=bestfit, bestParentfit=bestParentfit,
                       par.conflicts=par.conflicts,
                       par.NAfrac=par.NAfrac,
                       critweight=critweight,
                       parentsScoredWithF1=parentsScoredWithF1)
      } # F1.nobs>10

      #get the output line:
      bix <- mrk - minmrk + 1 #index to row in bres
      bres$m[bix] <- mrknr
      if (shift==0 || !append_shf) bres$MarkerName[bix] <- mrknames[mrk] else
        bres$MarkerName[bix] <- paste(mrknames[mrk], "shf", sep="_")
      bres[bix, 3:4] <- par.geno
      bres[bix, 5:(5+ploidyF1)] <- F1.counts
      bres$F1_NA[bix] <- F1.naCount
      startcol <- ploidyF1 + 7
      if (length(parent1) > 0) {
        bres[bix, startcol:(startcol-1+length(parent1))] <- parent.geno[[1]]
        startcol <- startcol + length(parent1)
      }
      if (length(parent2) > 0) {
        bres[bix, startcol:(startcol-1+length(parent2))] <- parent.geno[[2]]
        startcol <- startcol + length(parent2)
      }
      if (length(ancestors) > 0) {
        bres[bix, startcol:(startcol-1+length(ancestors))] <- ancestors.geno[[1]]
        startcol <- startcol + length(ancestors)
      }
      if (showAll) {
        bres[bix, startcol+seq(0, by=3, length.out=length(seginfo))] <-
          sprintf("%.4f", results$fracInvalid)
        bres[bix, startcol+seq(1, by=3, length.out=length(seginfo))] <-
          sprintf("%.4f", results$Pvalue)
        bres[bix, startcol+seq(2, by=3, length.out=length(seginfo))] <-
          results$matchParents
        startcol <- startcol + 3 * length(seginfo)
      }
      if (is.na(bestfit)) {
        bres[bix, startcol:(startcol+3)] <- rep(NA, 4)
      } else {
        bres[bix, startcol] <- results$segtype[bestfit]
        bres[bix, startcol+1] <- sprintf("%.4f", results$fracInvalid[bestfit])
        bres[bix, startcol+2] <- sprintf("%.4f", results$Pvalue[bestfit])
        bres[bix, startcol+3] <- results$matchParents[bestfit]
      }
      startcol <- startcol + 4
      if (is.na(bestParentfit)) {
        bres[bix, startcol:(startcol+3)] <- rep(NA, 4)
      } else {
        bres[bix, startcol] <- results$segtype[bestParentfit]
        bres[bix, startcol+1] <- sprintf("%.4f", results$fracInvalid[bestParentfit])
        bres[bix, startcol+2] <- sprintf("%.4f", results$Pvalue[bestParentfit])
        bres[bix, startcol+3] <- results$matchParents[bestParentfit]
      }
      startcol <- startcol + 4
      if (is.na(q[1])) {
        bres[bix, startcol:(startcol-1+length(q))] <- rep(NA, length(q))
      } else {
        bres[bix, startcol:(startcol-1+length(q))] <- sprintf("%.4f", q)
      }
      startcol <- startcol + length(q)
      if (is.data.frame(shiftmarkers)) bres[bix, startcol] <- shift
    } # for mrk
    if (batchnr == 1) {
      writeDatfile(bres, file=outfile)
    } else {
      writeDatfile(bres, file=outfile, append=TRUE, col.names=FALSE)
    }
    batchnr <- batchnr + 1
  } #while batch
  output <- readDatfile(outfile)
  output <- chk2integer(output)
  if (file.del) file.remove(outfile)
  invisible(output)
} #checkF1

#'@title Get substrings from the lefthand side
#'@description Get substrings from the lefthand side
#'@usage leftstr(x, n)
#'
#'@param x a character vector (or something having an as.character method)
#'@param n a single  number: if n>=0, the leftmost n characters of each element
#'of x are selected, if n<0 the (-n) rightmost characters are omitted
#'@return character vector with leftside substrings of x
#'@export
leftstr <- function(x, n) {
  #vectorized for x, not n
  #n>=0: take the leftmost n characters
  #n<0: take all but the rightmost (-n) characters
  if (n >= 0) {
    substr(x, 1, n) #automatically converts x to character if needed
  } else {
    #n < 0
    if (!is.character(x)) x <- as.character(x)
    len <- nchar(x)
    substr(x, 1, len + n)
  }
}

#'@title Get substrings from the righthand side
#'@description Get substrings from the righthand side
#'@usage rightstr(x, n)
#'
#'@param x a character vector (or something having an as.character method)
#'@param n a single  number: if n>=0, the rightmost n characters of each element
#'of x are selected, if n<0 the (-n) leftmost characters are omitted
#'@return character vector with rightside substrings of x
#'@export
rightstr <- function(x, n) {
  #vectorized for x, not n
  #n>=0: take the rightmost n characters
  #n<0: take all but the leftmost (-n) characters
  if (!is.character(x)) x <- as.character(x)
  len <- nchar(x)
  if (n >= 0) {
    start <- len - n + 1
    substr(x, start, len)
  } else {
    #n < 0
    substr(x, -n + 1, len)
  }
}

chk2integer <- function(chk) {
  # chk:    a data frame as returned by checkF1
  # result: the same data frame, but with all columns from Parent1 to
  #         last ancestor converted to integer (as these are sometimes factors
  #         with levels not equal to values after reading checkF1 from file)

  #find out the ploidy:
  F1cap <- names(chk)[which(names(chk) == "F1_NA") - 1]
  ploidyF1 <- as.integer(sub("F1_", "", F1cap))
  #find the column range to convert to integers:
  firstcol <- which(names(chk) == "parent1")
  #determine which column has the last parent or ancestor sample:
  #if 3 columns for each segtype are present (showAll TRUE), then
  #the first column after the last sample has name frqInvalid_<segtype> (where
  #<segtype> depends on the ploidy).
  #else (showAll FALSE) the next column has name "bestfit", followed
  #by "frqInvalid_bestfit"
  lastcol <- which(leftstr(names(chk), 10) == "frqInvalid")[1]
  if (rightstr(names(chk)[lastcol], 7) == "bestfit") {
    lastcol <- lastcol - 2
  } else {
    lastcol <- lastcol - 1
  }
  if (length(firstcol) != 1 || is.na(lastcol) ||
      lastcol - firstcol < 3 + ploidyF1) {
    stop("chk2integer: column names of chk incorrect")
  }
  for (i in firstcol:lastcol) {
    suppressWarnings(chk[[i]] <- as.integer(as.character(chk[[i]])))
    #warnings caused by NAs
  }
  chk
} #chk2integer

getPargeno <- function(P1consensus, P2consensus,
                       segtype, matchParents=NULL, seginfo,
                       allparentscores=FALSE) {
  #P1 consensus, P2consensus: the consensus scores of the two parents
  #segtype: the name of the selected segtype (selfit or (in the new polyploid
  #         version) bestParentfit from checkF1;
  #         should occur in names(seginfo); OR
  #         the number of the segtype in seginfo
  #matchParents: "Yes" if both consensus parental dosages match segtype
  #              (bestParentfit.matchParents from checkF1())
  #seginfo: a list as returned by calcSegtypeInfo(), with the par.dosages
  #         limited to those fitting the auto, allo and mixed parameters used to
  #         calculate checkF1
  #allparentscores: if FALSE (default) and more than one combination would be
  #         possible, c(NA, NA) is returned. If TRUE then a matrix with
  #         one row for each possibility and two columns for the 2 parental
  #         dosages is returned.
  #return value: an integer vector of two elements: the inferred dosages
  #         of parent 1 and 2; or (if allparentscores is TRUE and there are
  #         multiple combinations) a 2-column matrix with one row per
  #         parental combination

  # we try to fill in the expected parental dosage
  # based on the segtype and the consensus of the observed parental scores:
  parent.sc <- c(P1consensus, P2consensus)
  if (is.character(segtype)) {
    segtype <- which(names(seginfo) == segtype)
    if (length(segtype) != 1) stop("internal error in getPargeno")
  } #else segtype is already the number of the segtype
  exp.par <- seginfo[[segtype]]$pardosage #already selected to match
  #                                           auto/allo/mixed
  # now get the final parental dosage from consensus and segtype:
  if (nrow(exp.par) == 1) {
    # both parents must have the same genotype so we can fill these in,
    # irrespective of the observed parental genotype and the matchParent status:
    parent.sc <- exp.par[1,]
  } else {
    if (is.null(matchParents)) {
      matchParents <- getMatchParents(parGeno=parent.sc,
                                      seginfoItem=seginfo[[segtype]])
    }
    if (matchParents != "Yes") {
      # there is more than one possible parental combination.
      # if matchParents=="Yes" we don't need to do anything;
      # else (matchParents= Unknown, OneOK or No) we can only fill in the parents
      # if one parent matches one combination and the other matches none
      pmatch <- rep(NA, 2)
      for (p in 1:2) pmatch[p] <- match(parent.sc[p], exp.par[, p])
      if (sum(!is.na(pmatch)) == 1) {
        #one parent matches one expected parental combination and the other
        #doesn't match any
        r <- min(pmatch, na.rm=TRUE)
        parent.sc <- exp.par[r,]
      }
      else {
        # two possibilities:
        # - both parents each match a different row of exp.par
        # - none of the parents matches a row of exp.par
        # (if both parents would match the same row, matchParents would be Yes)
        # In both cases we cannot assign the parental genotypes:
        if (allparentscores) {
          parent.sc <- exp.par #a 2-column matrix with 2 or more rows
        } else parent.sc <- rep(NA, 2)
      }
    } #matchParents!="Yes"
  } #nrow(exp.par) == 1 else
  parent.sc
} #getPargeno

trim_shf <- function(mrknames) {
  #trim _shf from the marker names:
  if (is.factor(mrknames)) mrknames <- as.character(mrknames)
  if (is.character(mrknames)) {
    shf <- rightstr(mrknames, 4) == "_shf"
    mrknames[shf] <- leftstr(mrknames[shf], -4)
  }
  mrknames
} #trim_shf

makeCombscoresDf <- function(parent1, parent2, F1, ancestors=character(0),
                             other=character(0), nrow){
  #create an empty data frame with the correct size and column names and types
  #parent1 and parent2: character vectors with sampleIDs for parents 1 and 2
  #        (0, 1 or multiple samples per parent allowed, 0 samples are
  #        specified as character(0))
  #F1: character vector with sampleIDs for F1 individuals (one sample per
  #        F1 individual)
  #ancestors: character vector of other ancestor samples listed in the
  #        checkF1 output; character(0) if none
  #other: character vector with sampleIDs of any other samples; character(0) if none
  #nrow: the number of rows to create

  ncol <- length(parent1) + length(parent2) + length(ancestors) + 2 +
    length(F1) + length(other)
  scores <- matrix(rep(NA_integer_, nrow * ncol), nrow=nrow)
  df <- data.frame(
    MarkerName=rep("", nrow),
    segtype=rep("", nrow),
    scores,
    stringsAsFactors=FALSE
  )
  names(df) <- c("MarkerName", "segtype", parent1, parent2, ancestors,
                 "parent1", "parent2", F1, other)
  df
} #makeCombscoresDf

# writeDosagefile ***********************************************************
#'@title Write a file with segregation types and dosage scores
#'@description Write a file with for each marker the segregation type and the
#'dosage scores of the parental and ancestor samples, the parental consensus
#'dosages and the F1 samples.
#'@usage writeDosagefile(chk, scores, parent1, parent2, F1,
#'ancestors=character(0), other=character(0),
#'polysomic=TRUE, disomic=FALSE, mixed=FALSE,
#'ploidy, ploidy2, shiftParents, scorefile)
#'
#'@param chk data frame as returned by checkF1
#'@param scores data frame as read from the scores file produced by function
#'fitMarkers of package fitPoly (or a subset with at least columns
#'MarkerName, SampleName and geno)
#'@param parent1 character vector with the sample names of parent 1
#'@param parent2 character vector with the sample names of parent 2
#'@param F1 character vector with the sample names of the F1 individuals
#'@param ancestors character vector with the sample names of any other
#'ancestors
#'@param other other samples that should be treated like the F1
#'@param polysomic TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param disomic TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param mixed TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param ploidy the ploidy of parent 1 (must be even, 2 (diploid) or larger)
#'and the same as used by checkF1 to calculate the chk data frame
#'@param ploidy2 the ploidy of parent 2. If omitted it is assumed to be equal
#'to ploidy. Should be the same as used by checkF1 to calculate the chk
#'data frame
#'@param shiftParents if there is a column shift in chk the F1 dosages will be
#'shifted. If shiftParents is TRUE the parents and ancestors will be shifted
#'together with the F1, if FALSE only the F1 will be shifted in that case.\cr
#'If shiftParents is missing
#'or NA it will be set to TRUE except if ploidy2 != ploidy:
#'in that case this will result in an error (because it
#'may be that the parents are not genotyped or scored together with the
#'F1, the user should specify explicitly what to do)
#'@param scorefile filename for tab-separated text file with the dosages,;
#'if NA no file is written. For details of the contents see the return value
#'
#'@return A data frame with columns:
#'\itemize{
#'\item{MarkerName: the name of the marker}
#'\item{segtype: the segregation type}
#'\item{parental and ancestor samples: the dosages of those samples}
#'\item{parent1: the consensus dosage for parent1 as determined by checkF1}
#'\item{parent2: the consensus dosage for parent2 as determined by checkF1}
#'\item{F1 samples: the dosages for those samples}
#'\item{other samples: the dosages for those samples}
#'}
#'@export
writeDosagefile <- function(chk,
                            scores,
                            parent1,
                            parent2,
                            F1,
                            ancestors=character(0),
                            other=character(0),
                            polysomic=TRUE, disomic=FALSE, mixed=FALSE,
                            ploidy, ploidy2,
                            shiftParents,
                            scorefile) {
  #writeScoresfile uses compareProbes to just write the scores and segtypes
  #in the same format as compareProbes. It treats each marker separately
  #without comparing or combining two probes for the same SNP, but it takes
  #into account any shifts (from column shift in chk, if present).
  #It does so by calling compareProbes with probe.suffix=NA.
  if (missing(ploidy2)) ploidy2 <- ploidy
  if (missing(shiftParents)) shiftParents <- NA
  result <- compareProbes(
    chk=chk,
    scores=scores,
    probe.suffix=NA,
    fracdiff.threshold=0.04,
    parent1=parent1,
    parent2=parent2,
    F1=F1,
    ancestors=ancestors,
    other=other,
    polysomic=polysomic, disomic=disomic, mixed=mixed,
    ploidy=ploidy, ploidy2=ploidy2,
    shiftParents=shiftParents,
    qall_flavor="qall_mult",
    compfile=NA, combscorefile=scorefile)
  invisible(result$combscores)
} #writeDosagefile

# compareProbes ***************************************************************
#'@title Compare and combine results from two probes for the same SNP
#'@description On Affymetrix Axiom arrays it is possible to have two probes
#'interrogating the same SNP position. This function compares the dosage scores
#'and checkF1 results of the two probes; if they are sufficiently similar
#'a new marker is generated combining the results of the two probes. A dosage
#'file with the data for the separate probes as well as the combined markers
#'is written with the same format as writeDosagefile, and also a file
#'summarizing the comparison results.
#'
#'@usage compareProbes(chk, scores,
#'probe.suffix=c("P","Q","R"), fracdiff.threshold=0.04,
#'parent1, parent2, F1, ancestors=character(0), other=character(0),
#'polysomic=TRUE, disomic=FALSE, mixed=FALSE,
#'ploidy, ploidy2, qall_flavor="qall_mult", shiftParents,
#'compfile, combscorefile)
#'
#'@param chk data frame as returned by checkF1, or a subset with at least
#'columns markername, parent1, parent2 (the consensus parental genotypes),
#'the columns for the samples specified by parameters parent1, parent2 and
#'ancestors, and bestParentfit, and containing only rows with selected markers.
#'If a column  with a name as specified by qall_flavor (see below) is present
#'this will be written to file compfile, but it is not used: any selection of
#'marker based on qall (or other) must have been made beforehand, and the
#'rows for the unwanted markers must have been deleted from the chk data frame.\cr
#'For each marker*probe combination there may be an unshifted version
#'(shift==0), a shifted one (shift!=0), both, or neither.
#'\cr
#'If a column shift is present it will be used to shift the dosages
#'(and their P-values with them).
#'If some markernames end in "_shf" this part will be ignored, but the
#'P and Q suffixes (or alternatives as specified by probe.suffix) are
#'required to distinguish the two probes.
#'@param scores data frame as read from the scores file produced by function
#'fitMarkers of package fitPoly, with at least columns MarkerName,
#'SampleName, P0 .. P<ploidyF1> and geno (where <ploidyF1> is the ploidy of the
#'F1, i.e. the average of parental ploidy and ploidy2).\cr
#'If the F1 parents are scored separately, their rows should be added to the
#'scores data.frame for the F1 samples. If their ploidy is different from the
#'F1, the number of their P columns must be adjusted. The P data of the parents
#'are not used, they may all be set to NA.
#'@param probe.suffix a 3-item character vector specifying the suffixes of the
#'marker names that distinguish the two probes. The first two items identify
#'the two probes; the third item is used to indicate a new marker combining
#'the data from both probes. The three items must be different and have the
#'same number of characters default is c("P","Q","R")
#'@param fracdiff.threshold if more than this fraction of F1 scores differs
#'between probes, don't combine
#'@param parent1 character vector with the sample names of parent 1
#'@param parent2 character vector with the sample names of parent 2
#'@param F1 character vector with the sample names of the F1 individuals
#'@param ancestors character vector with the sample names of any other
#'ancestors
#'@param other other samples that should be treated like the F1
#'@param polysomic TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param disomic TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param mixed TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param ploidy the ploidy of parent 1 (must be even, 2 (diploid) or larger),
#'and the same as used by checkF1 to calculate the chk data frame
#'@param ploidy2 the ploidy of parent 2. If omitted it is assumed to be equal
#'to ploidy. Should be the same as used by checkF1 to calculate the chk
#'data frame
#'@param qall_flavor which quality parameter column must be shown in compfile,
#'default "qall_mult". If no quality data are wanted, specify "".
#'@param shiftParents if there is a column shift in chk the F1 dosages will be
#'shifted. If shiftParents is TRUE the parents and ancestors will be shifted
#'together with the F1, if FALSE only the F1 will be shifted in that case.\cr
#'If shiftParents is missing
#'or NA it will be set to TRUE except if ploidy2 != ploidy:
#'in that case this will result in an error (because it
#'may be that the parents are not genotyped or scored together with the
#'F1, the user should specify explicitly what to do)
#'@param compfile filename for tab-separated text file summarizing the
#'comparison results; if NA no file is written. For details of the contents see
#'the return value, component compstat
#'@param combscorefile filename for tab-separated text file with the dosages;
#'if NA no file is written. For details of the contents see the return value,
#'component combscores
#'
#'@details A combined marker is made in each case that a version of each of the
#'two probe markers is present and they are sufficiently similar. This means
#'that they have been assigned the same bestParentfit segregation type by
#'checkF1, and that the frequency of conflicting scores over all samples is
#'not more than fracdiff.threshold. The combined marker will have NA scores for
#'individuals where both probe markers are missing, the one available score if
#'it is scored for only one of the two probe markers or both scores are equal,
#'and the score with the highest P-value if the scores for both probe markers
#'are unequal.\cr
#'Any single-probe markers in chk that do not have a bestParentfit segregation
#'type are ignored and will not affect or appear in the output.
#'
#'@return A list with two components, compstat and combscores.\cr
#'compstat is a data frame with columns:
#'\itemize{
#'\item{MarkerName: name of the SNP marker. If a column shift is present in
#'data.frame chk, unshifted and shifted markers will get a "n" or "s" suffixed
#'to the MarkerName}
#'\item{segtypeP and segtypeQ: the segtype assigned by checkF1 to the first and
#'second probe}
#'\item{qallP and qallQ: the quality scores specified by parameter qall_flavor,
#'assigned by checkF1 to the two probes}
#'\item{countP and countQ: the number of versions of each of the probes
#'(0, 1, or 2, depending on whether a shifted, unshifted or both versions were
#'present)}
#'\item{countR: the number of combinations made of versions of the two probe
#'markers (one for each combination of a version of each of the two probe
#'markers, if they match well enough - see details)}
#'}
#'If the chk data frame contains a column shift, there are separate columns for
#'the non-shifted and shifted P and Q probe markers (suffix Pn, Ps, Qn, Qs),
#'and four columns for the R markers (suffix Rnn, Rns, Rsn, Rss where the first
#'n/s indicates if the P was non-shifted or shifted and the second n/s for the
#'Q probe.
#'combscores is a data frame with columns:
#'\itemize{
#'\item{MarkerName: the name of the marker. If the chk data frame contains a
#'column shift, the P and Q marker names are suffixed with n or s, and the R
#'marker names with nn, ns, sn, ss as described above}
#'\item{segtype: the segregation type}
#'\item{parental and ancestor samples: the dosages of those samples}
#'\item{parent1: the consensus dosage for parent1 as determined by checkF1}
#'\item{parent2: the consensus dosage for parent2 as determined by checkF1}
#'\item{F1 samples: the dosages for those samples}
#'\item{other samples: the dosages for those samples}
#'}
#'@export
compareProbes <- function(chk,
                          scores,
                          probe.suffix=c("P","Q","R"),
                          fracdiff.threshold=0.04,
                          parent1,
                          parent2,
                          F1,
                          ancestors=character(0),
                          other=character(0),
                          polysomic=TRUE, disomic=FALSE, mixed=FALSE,
                          ploidy, ploidy2,
                          qall_flavor="qall_mult",
                          shiftParents,
                          compfile, combscorefile) {
  #If probe.suffix is not NA but a set of 3 strings,
  #this function compares the results of the two probes for the same SNP (from
  #an Axiom array).
  #If probe.suffix is NA, this function only writes the scores from the
  #scores data frame and the segtypes from chk to combscorefile.
  #This option is used by function writeScoresfile, to avoid duplication a
  #lot of code and to obtain a scores file of the same format.


  #first some checks:
  noprobes <- length(probe.suffix) == 1 && is.na(probe.suffix)
  if (noprobes) funcname <- "writeScoresfile" else funcname <- "compareProbes"
  if (!noprobes &&
      (length(probe.suffix) != 3 || sum(is.na(probe.suffix)) > 0 ||
       max(nchar(probe.suffix)) != min(nchar(probe.suffix)) ||
       sum(abs(match(probe.suffix,probe.suffix) - c(1,2,3))) > 0))
    stop("compareProbes: probe.suffix invalid")
  if (ploidy %% 2 != 0)
    stop(paste(funcname, ": odd parental ploidy not allowed", sep=""))
  if (missing(ploidy2) || is.na(ploidy2)) ploidy2 <- ploidy else
    if (ploidy2 %% 2 != 0)
      stop(paste(funcname, ": odd parental ploidy2 not allowed", sep=""))
  ploidyF1 <- (ploidy + ploidy2) / 2
  segtypecol <- which(tolower(names(chk)) == "segtype")
  if (length(segtypecol) != 1)
    segtypecol <- which(tolower(names(chk)) == "selfit")
  if (length(segtypecol) != 1)
    segtypecol <- which(tolower(names(chk)) == "bestparentfit")
  if (length(segtypecol) != 1)
    stop(paste(funcname,
               ": chk should have one column segtype or bestParentfit",
               sep=""))
  if (!is.character(chk[, segtypecol]) )
    chk[, segtypecol] <- as.character(chk[, segtypecol])
  if (!is.character(chk$MarkerName))
    chk$MarkerName <- as.character(chk$MarkerName)
  #delete all rows without a segtype (mostly those that are not fitted by fitPoly):
  chk <- chk[!is.na(chk[, segtypecol]) & chk[, segtypecol] != "",]
  qallcol <- which(tolower(names(chk)) == qall_flavor)
  if (length(qallcol) > 1) qallcol <- qallcol[1]
  shiftspresent <- FALSE
  shiftcol <- which(tolower(names(chk)) == "shift")
  if (length(shiftcol) > 1) {#none is also allowed
    stop(paste(funcname, ": chk should have at most one column 'shift'"), sep="")
  } else if (length(shiftcol) == 0) {
    chk$shift <- rep(0, nrow(chk))
    shiftcol <- which(tolower(names(chk)) == "shift")
    shiftParents <- TRUE #if no shifts, shiftParents doesn't make a difference
  } else {
    #one existing column shift
    if (missing(shiftParents) || is.na(shiftParents)) {
      if (ploidy == ploidy2) shiftParents <- TRUE else
        stop(paste(funcname,
                   ": if ploidy2 != ploidy, shiftParents must be specified (FALSE/TRUE)",
                   sep=""))
      #note that this is not required if there is no existing column shift
    }
    if (anyNA(chk[, shiftcol]))
      stop(paste(funcname,
                 ": column 'shift' in chk may not contain missing values",
                 sep=""))
    shiftspresent <- TRUE
  }
  #check if all samples are present in scores and if their numbers match with chk:
  progenitors <- c(parent1, parent2, ancestors)
  progenitors <- progenitors[!is.na(progenitors)]
  if (length(progenitors) == 0) shiftParents <- FALSE # no progenitors to shift
  allsamp <- c(progenitors, F1, other)
  if (shiftParents) shfsamp <- rep(TRUE, length(allsamp)) else
    shfsamp <- c(rep(FALSE, length(progenitors)),
                 rep(TRUE, length(c(F1, other))))
  missamp <- allsamp[!(allsamp %in% unique(as.character(scores$SampleName)))]
  if (length(missamp) > 0)
    stop(paste(funcname, ": some samples not in scores:",
               paste(missamp, collapse=" ")))
  missamp <- progenitors[!(progenitors %in% names(chk))]
  if (length(missamp) > 0)
    stop(paste(funcname, ": some parent or ancestor samples not in chk:",
               paste(missamp, collapse=" ")))
  if (length(F1) != sum(chk[1, 5:(6+ploidyF1)]))
    stop(paste(funcname, ": F1 has", length(F1), "samples but in chk",
               sum(chk[1, 5:(6+ploidyF1)]), "values counted"))
  chk$SNPname <- trim_shf(chk$MarkerName)
  if (anyNA(match(chk$SNPname, unique(as.character(scores$MarkerName)))))
    stop(paste(funcname, ": not all markers from chk appear in scores", sep=""))

  if (noprobes) {
    #writeDosagefile
    chk$probe <- rep("", nrow(chk))
  } else {
    mnl <- nchar(chk$SNPname)
    chk$probe <- substr(chk$SNPname, mnl - nchar(probe.suffix[1]) + 1, mnl) #P or Q
    chk$SNPname <- substr(chk$SNPname, 1, mnl - nchar(probe.suffix[1])) #without P or Q
  }
  #now chk has some additional columns:
  #chk$shift: if it was not present, now has all 0's - the amount each marker is shifted
  #chk$probe: the probe suffix (P or Q by default)
  #probchk$SNPname: the original SNPname, without probe suffix or _shf
  #Note that the _shf suffix is not used, the shift value (0 vs not 0)
  #determines the shift status
  if (!noprobes && sum(!(chk$probe %in% probe.suffix[1:2])) > 0)
    stop("compareProbes: not all marker names have a specified probe suffix")
  if (!noprobes && sum(nchar(chk$SNPname) == 0) > 0)
    stop("compareProbes: not all marker names have a specified probe suffix")
  #make sure there is no locked file that will prevent writing the output:
  if (!is.na(compfile) && !checkFilename(compfile))
    stop(paste(funcname,": compfile not valid or not writable", sep=""))
  if (!is.na(combscorefile) && !checkFilename(combscorefile))
    stop(paste(funcname, ": combscorefile not valid or not writable", sep=""))

  if (is.factor(chk$parent1)) chk$parent1 <- as.integer(as.character(chk$parent1))
  if (is.factor(chk$parent2)) chk$parent2 <- as.integer(as.character(chk$parent2))
  if (is.factor(chk[,shiftcol]))
    chk[,shiftcol] <- as.integer(as.character(chk[,shiftcol]))
  if (is.factor(parent1)) parent1 <- as.character(parent1)
  if (is.factor(parent2)) parent2 <- as.character(parent2)
  if (is.factor(F1)) F1 <- as.character(F1)
  if (is.factor(ancestors)) ancestors <- as.character(ancestors)
  if (is.factor(other)) other <- as.character(other)

  snpnames <- sort(unique(chk$SNPname)) #SNPname is character

  #note that the two probes don't have the same set of SNPs, as some SNPs
  #may be rejected by fitPoly for one probe but not for the other,
  #and, if some markers are shifted, again this will not be the case for all.
  #we now make some arrays that for all SNPs
  #contain some data for each probe/shift combination, or NA where a probe was
  #rejected or the shifted or nonshifted situation did not apply
  segtype <- array(NA, dim=c(length(snpnames), 2, 2),
                   dimnames=list(
                     snpname=snpnames,
                     probe=c("probe1","probe2"),
                     shift=c("false","true")))
  chkshift <- segtype #for each marker/probe/shift the actual amount of shift
  qall <- segtype #for each marker/probe/shift the qall of the selected segtype (if any)
  cons.parent1 <- segtype #for each marker/probe/shift the consensus parent1 score (if any)
  cons.parent2 <- segtype #for each marker/probe/shift the consensus parent2 score (if any)
  Rsegtype <- segtype #for each comb of probe1 shifted/nonshifted and probe2 shifted/nonshifted
  #                    the combined segtype (R=P&Q) or NA
  #we fill 5 of these 6 3D arrays with data from chk, except (for the moment) Rsegtype:
  for (sh in 1:2) {
    if (sh == 1) sel.sh <- chk$shift == 0 else
      sel.sh <- chk$shift != 0
    if (noprobes)proberange <- 1 else proberange <- 1:2
    for (prb in proberange) {
      if (noprobes) {
        prbdat <- chk[sel.sh, ]
      } else {
        prbdat <- chk[chk$probe == probe.suffix[prb] & sel.sh, ]
      }
      rows <- match(prbdat$SNPname, snpnames)
      segtype[rows, prb, sh] <- prbdat[, segtypecol]
      chkshift[rows, prb, sh] <- prbdat[, shiftcol]
      if (length(qallcol) == 1) qall[rows, prb, sh] <- prbdat[, qallcol]
      cons.parent1[rows, prb, sh] <- prbdat$parent1
      cons.parent2[rows, prb, sh] <- prbdat$parent2
    }
  } #for sh
  seginfo <- selSegtypeInfo(calcSegtypeInfo(ploidy, ploidy2),
                            polysomic, disomic, mixed,
                            selfing=checkSelfing(parent1, parent2))
  if (sum(is.na(match(unique(chk[, segtypecol]),names(seginfo)))) > 0)
    stop("compareProbes: chk contains segtypes not expected with the settings\nof polysomic/disomic/mixed/ploidy/ploidy2")
  scores <- scores[scores$SampleName %in% allsamp,]
  #create an initial empty data frame to hold the combined scores;
  #this will be increased if needed
  combscores <- makeCombscoresDf(parent1=parent1, parent2=parent2, F1=F1,
                                 ancestors=ancestors, other=other,
                                 nrow=nrow(chk))
  combrow <- 0
  #subsetting from a big scores data.frame takes a lot of time
  #(approx 2 sec for each marker, in a file of ~26000 markers and 480 samples)
  #Therefore we subdivide it in batches of 200 SNPs
  batchsize <- 200
  batchcount <- (length(snpnames)-1) %/% batchsize + 1
  batchnr <- 1
  batchscores <- list()
  while (batchsize*(batchnr-1) < length(snpnames)) {
    cat(paste("batch", batchnr, "of", batchcount, "\n"))
    minmrk <- batchsize * (batchnr-1) + 1
    maxmrk <- min(length(snpnames), batchsize*batchnr)
    batchmarkers <- snpnames[minmrk:maxmrk]
    if (!noprobes) {
      batchmarkers <- c(paste(batchmarkers, probe.suffix[1], sep=""),
                        paste(batchmarkers, probe.suffix[2], sep=""))
      #note that for all snpnames both a P and a Q markername are created,
      #even if only one of the two in scores
    }
    batchscores <- scores[scores$MarkerName %in% batchmarkers, ]
    for (mrk in minmrk:maxmrk) {
      #for the current marker, list sc will hold the scores:
      #sc[[1]] and sc[[2]] are lists for probes P and Q (and sc[[3]] for R?)
      sc <- list()
      #the following 3D arrays are similar to the previously defined 6 3D arrays
      #except that here the 1st dim is samples while earlier it was markers
      progen.sc <- array(NA, dim=c(length(progenitors), 2, 2), #all progenitor samples scores
                         dimnames=list(
                           sample=progenitors,
                           probe=c("probe1","probe2"),
                           shift=c("FALSE","TRUE")))
      F1.sc <- array(NA, dim=c(length(F1)+length(other), 2, 2), #all F1 + other samples scores
                     dimnames=list(
                       sample=c(F1, other),
                       probe=c("probe1","probe2"),
                       shift=c("FALSE","TRUE")))
      parent.sc <- array(NA, dim=c(2, 2, 2), #both consensus parental scores
                         dimnames=list(
                           sample=c("parent1", "parent2"),
                           probe=c("probe1","probe2"),
                           shift=c("FALSE","TRUE")))
      P0col <- which(names(scores) == "P0")
      if (length(P0col) != 1)
        stop("compareProbes: scores should contain columns P0 .. P<ploidy>")
      for (prb in 1:2) if (any(!is.na(segtype[mrk, prb,]))) {
        #unshifted and/or shifted version present for this marker/probe
        if (noprobes) {
          mrkname_scores <- snpnames[mrk]
        } else {
          mrkname_scores <- paste(snpnames[mrk], probe.suffix[prb], sep="")
        }
        sc[[prb]] <- list()
        #sc[[prb]][[1]] gets the unshifted score data for the population samples
        #               i.e. all columns from the fitPoly output including
        #               P0..P<ploidyF1>  and geno,
        #               and all samples in parent1, parent2, ancestors, F1, other
        sc[[prb]][[1]] <- batchscores[batchscores$MarkerName == mrkname_scores &
                                        batchscores$SampleName %in% allsamp, ]
        sc[[prb]][[1]] <-
          sc[[prb]][[1]][match(allsamp, sc[[prb]][[1]]$SampleName),] #sorted
        #if this probe is shifted, sc[[prb]][[2]] gets the shifted score data:
        if (!is.na(segtype[mrk, prb, 2])) {
          sc[[prb]][[2]] <- sc[[prb]][[1]]
          #shift the geno values:
          sv <- chkshift[mrk, prb, 2] #shift value, positive or negative, not 0
          # The shift in the next lines must not be applied to progenitors
          # if shiftParents is FALSE; this is done with shfsamp
          sc[[prb]][[2]]$geno[shfsamp] <- sc[[prb]][[2]]$geno[shfsamp] + sv
          #old: geno shifted out of 0.. ploidyF1 made missing:
          #inval <- shfsamp & !(sc[[prb]][[2]]$geno %in% 0:ploidyF1)
          #sc[[prb]][[2]]$geno[inval] <- NA
          #new: geno shifted out of 0..ploidyF1 made equal to 0 or ploidyF1:
          inval <- shfsamp & sc[[prb]][[2]]$geno < 0
          sc[[prb]][[2]]$geno[inval] <- 0
          inval <- shfsamp & sc[[prb]][[2]]$geno > ploidyF1
          sc[[prb]][[2]]$geno[inval] <- ploidyF1

          #shift the P-value columns (only to adjust the Pmax)
          if (sv > 0) {
            #old: discard the P's of classes shifted beyond ploidyF1
            #sc[[prb]][[2]][shfsamp, P0col+(sv:ploidyF1)] <-
            #  sc[[prb]][[2]][shfsamp, P0col+(0:(ploidyF1-sv))]
            #new: sum all P's for classes that are merged into dosage==ploidy:
            sc[[prb]][[2]][shfsamp, P0col+ploidyF1] <-
              sum(sc[[prb]][[2]][shfsamp, P0col+((ploidyF1-sv):ploidyF1)])
            #and shift all other existing P values:
            if (sv < ploidy)
              sc[[prb]][[2]][shfsamp, P0col+(sv:(ploidyF1-1))] <-
                sc[[prb]][[2]][shfsamp, P0col+(0:(ploidyF1-1-sv))]
            #and set the P's for classes that are new to 0:
            sc[[prb]][[2]][shfsamp ,P0col+(0:(sv-1))] <- 0
          } else {
            #chkshift < 0
            #old: discard the P's of classes shifted beyond 0
            #sc[[prb]][[2]][shfsamp, P0col+(0:(ploidyF1+sv))] <-
            #  sc[[prb]][[2]][shfsamp, P0col+((-sv):ploidyF1)]
            #new: sum all P's for classes that are merged into dosage==0:
            sc[[prb]][[2]][shfsamp, P0col] <-
              sum(sc[[prb]][[2]][shfsamp, P0col+(0:(-sv))])
            #and shift all other existing P values:
            if ((-sv) < ploidyF1)
              sc[[prb]][[2]][shfsamp, P0col+(1:(ploidyF1+sv))] <-
                sc[[prb]][[2]][shfsamp, P0col+((1-sv):ploidyF1)]
            #and set the P's for classes that are new to 0:
            sc[[prb]][[2]][shfsamp, P0col+(ploidyF1+sv+1):ploidyF1] <- 0 #NA
          }
          suppressWarnings(
            #the warning occur when a sample is not genotyped for this marker
            #(its ratio was set to NA before the fitPoly run,
            #perhaps based on a too low signal intensity).
            #this results in sc[[prb]][[2]]$maxP being assigned -Inf;
            #that is not a problem later on
            sc[[prb]][[2]]$maxP <-
              apply(sc[[prb]][[2]][P0col+(0:ploidyF1)], 1, max, na.rm=TRUE)
          )
        }
        #Now we have a list sc of data frames for the scores per probe (P,Q=1,2),
        #without (1) or with (2) shift.
        #Note that only the P-value columns and the geno value are shifted,
        #not the maxgeno; and maxP is the max of the remaining P-values

        #next we extract geno values for subsets of samples
        #and write the markers:
        for (sh in 1:2) {
          if (!is.na(segtype[mrk, prb, sh])) {
            mrkname_out <- mrkname_scores
            if (shiftspresent) mrkname_out <-
                paste(mrkname_out, c("n", "s")[sh], sep="")
            #progen.sc[, prb, sh] <-
            #    sc[[prb]][[sh]]$geno[sc[[prb]][[sh]]$SampleName %in% progenitors]
            #adapted, as (for triploids) parents are not always scored
            progensc <- sc[[prb]][[sh]][sc[[prb]][[sh]]$SampleName %in% progenitors,]
            progen.sc[match(progensc$SampleName,progenitors), prb, sh] <-
              progensc$geno
            F1.sc[, prb, sh] <-
              sc[[prb]][[sh]]$geno[sc[[prb]][[sh]]$SampleName %in% c(F1, other)]
            parent.sc[, prb, sh] <- getPargeno(cons.parent1[mrk, prb, sh],
                                               cons.parent2[mrk, prb, sh],
                                               segtype=segtype[mrk, prb, sh],
                                               seginfo=seginfo)

            #now we have all data for this mrk / prb / shift combination,
            #we write it to the combscores data frame:
            combrow <- combrow+1
            if (combrow > nrow(combscores))
              combscores <- rbind(combscores,
                                  makeCombscoresDf(parent1=parent1, parent2=parent2,
                                                   F1=F1, ancestors=ancestors,
                                                   other=other, nrow=5000)
                                 )
            combscores$MarkerName[combrow] <- mrkname_out
            combscores$segtype[combrow] <- segtype[mrk, prb, sh]
            combscores[combrow, 3:length(combscores)] <-
              c(progen.sc[, prb, sh],
                parent.sc[, prb, sh],
                F1.sc[, prb, sh])
          }
        } #for sh
      } #for prb if !is.na(segtype[mrk, prb])

      #now we have written the scores per probe/sh to combscorefile for current marker mrk
      #finally we check for which markers we can produce combined scores for both probes:
      if (!noprobes) {
        numcomb <- 0
        for (sh1 in 1:2) for (sh2 in 1:2)
          if (!is.na(segtype[mrk, 1, sh1]) && !is.na(segtype[mrk, 2, sh2]) &&
              segtype[mrk, 1, sh1] == segtype[mrk, 2, sh2]) {
            bothscored <- sum(!is.na(sc[[1]][[sh1]]$geno) &
                                !is.na(sc[[2]][[sh2]]$geno))
            diff <- sc[[1]][[sh1]]$geno != sc[[2]][[sh2]]$geno
            Ndifferent <- sum(diff, na.rm=TRUE)
            fracdiff <- Ndifferent / bothscored
            #Note: this takes into account all samples if shiftParents TRUE,
            #and only F1 + other (i.e. non-progenitor) samples if shiftParents FALSE
            if (!is.na(fracdiff) && fracdiff <= fracdiff.threshold) {
              #combine the scores of both probes for this marker
              Rsegtype[mrk, sh1, sh2] <- segtype[mrk, 1, sh1]
              numcomb <- numcomb + 1 # it is possible that more than one combination
              #                        of shifts for the same SNP is possible
              #first fill in missing scores from sc[[2]]:
              combgeno <- sc[[1]][[sh1]]$geno
              w <- is.na(sc[[1]][[sh1]]$geno) & !is.na(sc[[2]][[sh2]]$geno)
              combgeno[w] <- sc[[2]][[sh2]]$geno[w]
              #next, for the different scores take the one with highest maxP:
              w <- !is.na(sc[[1]][[sh1]]$geno) &
                   !is.na(sc[[2]][[sh2]]$geno) &
                   sc[[2]][[sh2]]$geno != sc[[1]][[sh1]]$geno &
                   sc[[2]][[sh2]]$maxP > sc[[1]][[sh1]]$maxP
              combgeno[w] <- sc[[2]][[sh2]]$geno[w]
              #combine the parental consensus scores:
              parent.na <- matrix(NA, nrow=2, ncol=2)
              parent.na[1,] <- is.na(parent.sc[, 1, sh1])
              parent.na[2,] <- is.na(parent.sc[, 2, sh2])
              #parent.na has rows for probe 1 and 2, and columns for
              #parent 1 and 2, for the current set of sh1 and sh2
              combparent <- parent.sc[, 1, sh1] #both parental scores of probe 1
              #missing in probe 1 filled in from probe 2:
              w <- parent.na[1,] & !parent.na[2,]
              combparent[w] <- parent.sc[w, 2, sh2]
              #different in the two probes made missing (in contrast to sample
              #scores, for the consensus scores we don't have P-values to
              #select one):
              w <- !parent.na[1,] & !parent.na[2,] &
                   parent.sc[, 1, sh1] != parent.sc[, 2, sh2]
              combparent[w] <- NA
              #possible problem left: both probes have one (a different) parent
              #genotyped; in each is matchParent=OneOK, but the combination does
              #not match; then set both parents to NA:
              if (xor(parent.na[1,1], parent.na[1,2]) &&
                  xor(parent.na[2,1], parent.na[2,2]) &&
                  xor(parent.na[1,1], parent.na[2,1]) &&
                  getMatchParents(combparent, seginfo[[segtype[mrk, prb]]]) == "No") {
                #note that one can index a list (or vector) using the element names!
                combparent <- c(NA, NA)
              }
              if (shiftspresent) {
                mrkname_cmb <- paste(snpnames[mrk], "R",
                                     paste(c("n","s")[c(sh1, sh2)], collapse=""),
                                     sep="")
              } else {
                mrkname_cmb <- paste(snpnames[mrk], "R", sep="")
              }
              combrow <- combrow+1
              if (combrow > nrow(combscores))
                combscores <-
                  rbind(combscores,
                        makeCombscoresDf(parent1=parent1, parent2=parent2,
                                         F1=F1, ancestors=ancestors,
                                         other=other, nrow=5000)
                       )
              combscores$MarkerName[combrow] <- mrkname_cmb
              combscores$segtype[combrow] <- segtype[mrk, 1, sh1]
              combscores[combrow, 3:length(combscores)] <-
                c(combgeno[1:length(progenitors)],
                  combparent,
                  combgeno[(length(progenitors)+1):length(combgeno)])
            } #if fracdiff
          } #if segtypesame
      } #if !noprobes
    } #for mrk
    batchnr <- batchnr + 1
  } #while batch  probe.suffix[prb]
  if (combrow==0) combscores <- combscores[0,] else
    combscores <- combscores[1:combrow,]
  if (!is.na(combscorefile))
    writeDatfile(combscores, combscorefile)
  #finally we write an overview file:
  if (noprobes) {
    compstat <- NA
  } else {
    compstat <- data.frame(SNPname=snpnames)
    sufx <- matrix(c(paste(probe.suffix[1], "n", sep=""),
                     paste(probe.suffix[1], "s", sep=""),
                     paste(probe.suffix[2], "n", sep=""),
                     paste(probe.suffix[2], "s", sep="")), 2, 2, byrow=TRUE)
    max_sh <- 2
    Rsufx <- matrix(c("nn", "ns", "sn", "ss"), 2, 2, byrow=TRUE)
    if (!shiftspresent) {
      max_sh <- 1
      sufx <- substr(sufx, 1, 1)
      Rsufx <- matrix("")
    }
    for (prb in 1:2) for (sh in 1:max_sh) {
      compstat[[paste("segtype", sufx[prb, sh], sep="")]] <- segtype[, prb, sh]
      if (length(qallcol) == 1) {
        compstat[[paste("qall", sufx[prb, sh], sep="")]] <- qall[, prb, sh]
      }
    }
    for (sh1 in 1:max_sh) for (sh2 in 1:max_sh) {
      compstat[[paste("segtype", probe.suffix[3], Rsufx[sh1, sh2], sep="")]] <-
        Rsegtype[, sh1, sh2]
    }
    for (prb in 1:2) {
      compstat[[paste("count", probe.suffix[prb], sep="")]] <-
        rowSums(!is.na(segtype[, prb,]))
    }
    compstat[[paste("count", probe.suffix[3], sep="")]] <-
      apply(!is.na(Rsegtype), 1, sum)
    rownames(compstat) <- NULL
    if (!is.na(compfile))
      writeDatfile(compstat, compfile)
  } #if !noprobes
  invisible(list(combscores=combscores, compstat=compstat))
} # compareProbes

# removeRedundant ***************************************************************
#'@title Remove redundant single-probe markers
#'@description If for a SNP both probes produced very similar results, function
#'compareProbes generated an extra marker combining the results from both
#'probes. In that case the original single-probe markers are redundant and this
#'function removes those data from the compstat and combscores data frames.
#'
#'@usage removeRedundant(compstat, combscores, compfile, combscorefile)
#'
#'@param compstat a data.frame as returned by compareProbes in the
#'compstat item of the return value and in the compfile
#'@param combscores a data.frame as returned by compareProbes in the
#'@param compfile a filename to which the compstat part of the results is
#'written; if NA no file is written
#'@param combscorefile a filename to which the compscores part of the results is
#'written; if NA no file is written
#'
#'@return A list with two components, compstat and combscores. These are identical
#'to the parameters compstat and combscores, with the redundant single-probe
#'markers removed.For their contents see function compareProbes.
#'@export
removeRedundant <- function(compstat, combscores, compfile, combscorefile) {
  #make sure there is no locked file that will prevent writing the output:
  if (!is.na(compfile) && !checkFilename(compfile))
    stop("removeRedundant: compfile not valid or not writable")
  if (!is.na(combscorefile) && !checkFilename(combscorefile))
    stop("removeRedundant: combscorefile not valid or not writable")
  #get the probe.suffix letters:
  sw <- which(substr(names(compstat), 1, 7) == "segtype") #column numbers in compstat
  #make sure that empty segtype cells are NA:
  for (i in sw) {
    compstat[,i] <- as.character(compstat[,i])
    compstat[!is.na(compstat[,i]) & compstat[,i] == "", i] <- NA
    compstat[,i] <- factor(compstat[,i])
  }
  qw <- which(substr(names(compstat), 1, 4) == "qall")
  #note that qall columns may or may not be present
  sn <- names(compstat)[sw] #segtypePn ... segtypeRss, or segtypeP, segtypeQ, segtypeR
  shiftspresent <- length(sn) == 8 #2P, 2Q, 4R; else length(sn)==3: P, Q, R
  if (shiftspresent) {
    shift.suffix <- c("n", "s")
    probe.suffix <- c(substr(sn[1], 8, nchar(sn[1])-1), #P
                      substr(sn[3], 8, nchar(sn[3])-1), #Q
                      substr(sn[5], 8, nchar(sn[5])-2)) #R
    scol <- matrix(sw[1:6], 3, 2, byrow=TRUE) #probe(P/Q/R) in row, shift in col,
    #                           cell value is column nr in compstat
    #Note: for R there are 4 columns (nn/ns/sn/ss), here only 2 are listed,
    #but actually only the entry for the first R column is used
    row.names(scol) <- probe.suffix
    colnames(scol) <- shift.suffix
    if (length(qw) > 0) qcol <- matrix(qw, 2, 2, byrow=TRUE)
    max_sh <- 2 #shift 1: not shifted; shift 2: shifted
  } else {
    shift.suffix <- ""
    probe.suffix <- c(substr(sn[1], 8, nchar(sn[1])), #P
                      substr(sn[2], 8, nchar(sn[2])), #Q
                      substr(sn[3], 8, nchar(sn[3]))) #R
    scol <- matrix(sw, 3, 1) #col nrs of P/Q/R in compstat
    row.names(scol) <- probe.suffix
    if (length(qw) > 0) qcol <- matrix(qw, 2, 1, byrow=TRUE)
    max_sh <- 1
  }
  omitmrk <- character(0)
  for (sh1 in 1:max_sh) for (sh2 in 1:max_sh) {
    combsnp <- !is.na(compstat[, scol[3, 1] + 2*(sh1-1) + (sh2-1)]) # column of
    #          the R markers for this combination of P and Q shifts
    #for these markers, both the P and Q (with the same shift) are present and
    #must be omitted:
    if (sh1 != sh2) {
      #if one probe was shifted and the other not, the current combination is
      #the only possibly correct one and we omit also the other-shifted
      #versions of the single probe markers:
      omitmrk <- c(omitmrk,
                   paste(compstat$SNPname[combsnp],
                         probe.suffix[1], shift.suffix[sh1], sep=""),
                   paste(compstat$SNPname[combsnp],
                         probe.suffix[1], shift.suffix[sh2], sep=""),
                   paste(compstat$SNPname[combsnp],
                         probe.suffix[2], shift.suffix[sh1], sep=""),
                   paste(compstat$SNPname[combsnp],
                         probe.suffix[2], shift.suffix[sh2], sep=""))
      compstat[combsnp, scol[1, ]] <- NA
      compstat[combsnp, scol[2, ]] <- NA
      if (length(qw) > 0) {
        compstat[combsnp, qcol[1, ]] <- NA
        compstat[combsnp, qcol[2, ]] <- NA
      }
    } else {
      #the current combination has the same shift for both probes,
      #we delete only those versions of the probes because the other-shifted
      #version might actually be the correct one (and there the combination
      #might not have been made due to a lower or higher number of deleted
      #scores):
      omitmrk <- c(omitmrk,
                   paste(compstat$SNPname[combsnp],
                         probe.suffix[1], shift.suffix[sh1], sep=""),
                   paste(compstat$SNPname[combsnp],
                         probe.suffix[2], shift.suffix[sh2], sep=""))
      #erase the segtype and qall for the removed markers:
      compstat[combsnp, scol[1, sh1]] <- NA
      compstat[combsnp, scol[2, sh2]] <- NA
      if (length(qw) > 0) {
        compstat[combsnp, qcol[1, sh1]] <- NA
        compstat[combsnp, qcol[2, sh2]] <- NA
      }
    }
  }
  combscores <- combscores[!(combscores$MarkerName %in% omitmrk), ]
  if (!is.na(combscorefile))
    writeDatfile(combscores, combscorefile)
  #adjust the count columns in compstat:
  for (prb in 1:2) {
    compstat[[paste("count", probe.suffix[prb], sep="")]] <-
      rowSums(!is.na(compstat[, scol[prb,,drop=FALSE],drop=FALSE]))
  }
  if (shiftspresent) Rcols <- scol[3,1]+(0:3) else Rcols <- scol[3,1] #NOT +(0:1) ???
  compstat[[paste("count", probe.suffix[3], sep="")]] <-
    rowSums(!is.na(compstat[, Rcols, drop=FALSE]))
  if (!is.na(compfile))
    writeDatfile(compstat, compfile)
  invisible(list(combscores=combscores, compstat=compstat))
} #removeRedundant

#'@title Convert the F1 dosage scores data.frame or file to polymapR
#'input matrix
#'@description Functions writeDosagefile, compareProbes and removeRedundant
#'produce score files. This function reads such a file and converts the result
#'to a data.frame as needed by the polymapR package.
#'
#'@usage F1Dosages2Matrix(dosages, outfile=NA, dec=".", sep=",")
#'
#'@param dosages name of an F1 dosage scores file as produced by functions
#'writeDosagefile, compareProbes or removeRedundant, or a data frame
#'read from such a file
#'@param outfile (path and) name of an output file, default NA means that no
#'file is written. The file is written by function writeDatfile; normally using
#'the default parameters but if file extension is 'csv', quote is set to TRUE
#'and the specified dec and sep parameters are used
#'@param dec character to use as decimal separator in the output file,
#'default "."; only used if extension of outfile is 'csv' (otherwise ".")
#'@param sep character to use as field separator in the output file,
#'default ","; only used if extension of outfile is 'csv' (otherwise
#'tab character)
#'
#'@return A matrix as needed by the polymapR package: one row per marker,
#'one column per sample, containing integer scores (0 .. ploidy) or NAs.
#'Row names are marker names, column names are sample names; the first two
#'columns are the (consensus) scores of Parent1 and Parent2.\cr
#'If outfile is not NA also an output file is written: if the extension is
#''csv' a csv file is written with the specified decimal and field separators
#'(default '.' and ',') and with row and column names quoted, else a
#'tab-separated file.
#'@export
F1Dosages2Matrix <- function(dosages, outfile=NA,
                             dec=".", sep=",") {
  if (length(outfile) != 1) outfile <- outfile[1]
  if (is.null(outfile) || outfile == "") outfile <- NA
  if (!is.na(outfile) & !check.filename(outfile))
    stop(paste("F1Dosages2Matrix: cannot write file", outfile))
  if (is.character(dosages) && length(dosages) == 1)
    dosages <- readDatfile(dosages)
  if (!is.data.frame(dosages))
    stop("F1Dosages2Matrix: dosages must be a data.frame or a filename")
  parcol1 <- which(tolower(names(dosages)) == "parent1")
  pstr <- leftstr(names(dosages)[parcol1], -1) #parent in upper/lower case
  if (length(parcol1) > 1) {
    stop("F1Dosages2Matrix: dosages must have one column 'parent1'")
  } else if (length(parcol1) == 0) {
    parcol1 <- which(names(dosages) == "P1")
    if (length(parcol1) != 1)
      stop("F1Dosages2Matrix: dosages must have one column 'parent1'")
    pstr <- "P"
  }
  parcol2 <- which(names(dosages) == paste(pstr, 2, sep=""))
  if (length(parcol2) != 1 || parcol2 != parcol1 + 1)
    stop("F1Dosages2Matrix: column for parent1 and parent2 must be consecutive")
  dosages <- dosages[, -(2:(parcol1-1))]
  if (!is.na(outfile)) {
    if (tolower(tools::file_ext(outfile)) == "csv") {
      writeDatfile(dosages, outfile, quote = TRUE, sep=sep, dec=dec)
    } else writeDatfile(dosages, outfile)
  }
  mat <- as.matrix(dosages[, -1])
  rownames(mat) <- dosages[,1]
  invisible(mat)
} #F1Dosages2Matrix

# expandUnknownParents *******************************************************
#'@title Generate markers with all combinations of parental scores
#'@description For markers where the segregation type in the F1 is known but the
#'parental consensus scores are missing, this function generates multiple
#'versions of the marker, each with a different combination of parental scores
#'matching the segregation and a different suffix to the marker name.
#'
#'@usage expandUnknownParents(scores, sep="@",
#'polysomic=TRUE, disomic=FALSE, mixed=FALSE,
#'ploidy, ploidy2, scorefile)
#'
#'@param scores a data frame as returned by writeDosagefile, or the combscores
#'item in the return value of compareProbes or removeRedundant
#'@param sep a short string (one or more characters, default "@")
#'to separate the original marker name from the consecutive letters
#'(a, b, c etc) that identify the different versions of the marker. Markers
#'that already have parental scores are not modified
#'@param polysomic TRUE or FALSE; should be the same as used by checkF1
#'@param disomic TRUE or FALSE; should be the same as used by checkF1
#'@param mixed TRUE or FALSE; should be the same as used by checkF1
#'@param ploidy the ploidy of parent 1 (must be even, 2 (diploid) or larger),
#'and the same as used by checkF1
#'@param ploidy2 the ploidy of parent 2. If omitted it is assumed to be equal
#'to ploidy. Should be the same as used by checkF1
#'@param scorefile a filename to which the result is written;
#'if NA no file is written
#'
#'@return A data frame with the same format and contents as parameter scores,
#'with each marker where parental scores were missing expanded to
#'multiple rows, one per parental dosage combination
#'@export
expandUnknownParents <- function(scores,
                                 sep="@",
                                 polysomic=TRUE, disomic=FALSE, mixed=FALSE,
                                 ploidy, ploidy2,
                                 scorefile) {
  if (ploidy %% 2 != 0)
    stop("expandUnknownParents: odd parental ploidy not allowed")
  if (missing(ploidy2) || is.na(ploidy2)) ploidy2 <- ploidy else {
    if (ploidy2 %% 2 != 0)
      stop("expandUnknownParents: odd parental ploidy2 not allowed")
  }
  ploidyF1 <- (ploidy + ploidy2) / 2
  if (!polysomic && !disomic && !mixed)
    stop("expandUnknownParents: at least one of polysomic, disomic and mixed must be TRUE")
  if (!is.na(scorefile) && !checkFilename(scorefile))
    stop("expandUnknownParents: scorefile not valid or not writable")
  seginfo <- calcSegtypeInfo(ploidy, ploidy2)
  allsegtypenames <- names(seginfo)
  seginfo <- selSegtypeInfo(seginfo, polysomic, disomic, mixed)
  unpar <- which(is.na(scores$parent1) & is.na(scores$parent2))
  if (length(unpar) > 0) {
    #at least one marker needs to be expanded
    scores$MarkerName <- as.character(scores$MarkerName)
    scores$segtype <- as.character(scores$segtype)
    p12 <- which(names(scores) == "parent1")
    p12 <- p12:(p12+1)
    newscores <- scores[0,] # an empty data frame with same columns
    for (m in unpar) {
      mname <- scores$MarkerName[m]
      parscores <- seginfo[[scores$segtype[m]]]$pardosage
      #we put the first version into the old scores data frame:
      scores$MarkerName[m] <- paste(mname, "a", sep=sep)
      scores[m, p12] <- parscores[1,]
      #if there are more versions we add those to newscores:
      if (nrow(parscores) > 1) {
        mscores <- scores[rep(m, nrow(parscores)-1),] #copy the scores row several times
        mscores$MarkerName <- paste(mname, letters[2:nrow(parscores)], sep=sep)
        mscores[,p12] <- parscores[-1,]
        newscores <- rbind(newscores, mscores)
      }
    } # for m in unpar
    if (nrow(newscores) > 0) {
      scores <- rbind(scores, newscores)
      scores <- scores[order(scores$MarkerName),]
    }
  } #length(unpar)>0
  if (!is.na(scorefile)) writeDatfile(scores, scorefile)
  scores
} #expandUnknownParents


# checkFilename *************************************************************
#'@title Check if a file can be created
#'@description Checks if a file with that name can be created.
#'If successful, any pre-existing file does not exist any more.
#'
#'@usage checkFilename(filename, overwrite=TRUE)
#'
#'@param filename a file name with or without path. If the name contains a
#'path the entire path must already exist, else the file cannot be created
#'and the result is FALSE.
#'@param overwrite if TRUE an attempt is made to delete any pre-existing file
#'of that name; the function returns FALSE if the file is locked or the user
#'has no rights to delete the file. If FALSE the function returns FALSE
#'if any file of that name already exists, and no attempt is made to remove it
#'
#'@return TRUE if a new file can be created, else FALSE. If TRUE, no file of
#'that name exists (any more)
#'@export
checkFilename <- function(filename, overwrite=TRUE) {
  filename <- filename[1] # we test only the first filename
  if (!overwrite && file.exists(filename)) return(FALSE)
  tryCatch(
    { suppressWarnings({
      if (file.exists(filename)) file.remove(filename)
      if (file.create(filename)) {
        file.remove(filename)
        TRUE
      } else FALSE})
    }, error = function(e) { FALSE }
  )
} #checkFilename

# scores2wide *************************************************************
#'@title Convert a scores file from long to wide format
#'@description The fitMarkers function of package fitPoly returns a
#'scores file in "long" format: one row for each MarkerName / SampleName
#'combination. This function creates a file in "wide" format (samples in
#'columns, markers in rows) containing the data from the geno column of the
#'scores file. Faster, less memory-expensive and easier to use than reshape.
#'
#'@usage scores2wide(scores, outfile=NA)
#'
#'@param scores a data frame read from the scorefile output of saveMarkeModels
#'in package fitPoly. The order of the markers within each sample does not have
#'to be the same or v.v., there is no requirement for any ordering of scores.
#'However if scores is not ordered by MarkerName and SampleName and/or not all
#'markers are present for all samples (or v.v.) this results in a slower
#'processing.
#'@param outfile a file name to which the result is written. If NA no file
#'is written.
#'
#'@return A data frame with column names marker, MarkerName and all sample
#'names, and one row per marker. Marker is the sequential number of the marker
#'as reported in the scores data frame.
#'@export
scores2wide <- function(scores, outfile=NA) {
  outfile <- outfile[1]
  if (!is.na(outfile) && !checkFilename(outfile))
    stop("scores2wide: outfile is not a valid or writable file")
  #convert Markername and sample to factors, eliminating unused levels and
  #with levels sorted according to the current locale:
  markernrs <- scores$marker
  sampnames <- as.character(unique(scores$SampleName))
  mrknames <- as.character(unique(scores$MarkerName))
  #get the original marker(number) of each marker present:
  marknrs <- tapply(markernrs, factor(scores$MarkerName),
                    FUN=function(x) mean(x, na.rm=TRUE))
  rm(markernrs)
  #check if scores is complete and sorted:
  if (nrow(scores) > length(sampnames) * length(mrknames))
    stop("scores2wide: each combination of markername and sample may occur only once")
  df <- expand.grid(SampleName=sampnames, MarkerName=mrknames,
                    stringsAsFactors=FALSE) #needed!
  if (nrow(scores) == length(sampnames) * length(mrknames)) {
    #sort scores: it is important that within all samples all markers are in the
    #same order:
    if (sum(df$MarkerName != scores$MarkerName) > 0 ||
        sum(df$SampleName != scores$SampleName) > 0) {
      scores <- scores[order(scores$MarkerName, scores$SampleName),]
    }
  } else {
    #scores incomplete: not all combinations of markername and sample occur
    #(and some might be double, but we don't check for that)
    #we make it complete and sort it:
    scores <- merge(x=scores, y=df, all.y=TRUE, sort=TRUE)
  }
  #convert the geno column to wide format:
  wide <- matrix(integer(length(mrknames) * length(sampnames)),
                 nrow=length(mrknames), ncol=length(sampnames))
  keepcols <- match(c("MarkerName", "SampleName", "geno"), names(scores))
  batchsize <- 10 #there are usually far less samples than markers
  maxsamp <- 0
  while (maxsamp < length(sampnames)) {
    minsamp <- maxsamp + 1
    maxsamp <- min(minsamp + batchsize - 1, length(sampnames))
    batchscores <- scores[scores$SampleName %in% sampnames[minsamp:maxsamp],
                          keepcols]
    for (samp in minsamp:maxsamp) {
      sampscores <- batchscores[batchscores$SampleName == sampnames[samp],]
      # if (checksc) {
      #   missmrk <- setdiff(1:length(mrknames), as.integer(sampscores$MarkerName))
      #   if (length(missmrk) > 0) {
      #     sampscores <- sampscores[1:length(mrknames),] #extra rows with NA
      #     sampscores$MarkerName[(length(mrknames)-length(missmrk)+1):
      #                             length(mrknames)] <- mrknames[missmrk]
      #   }
      # }
      # if (samp == 1) {
      #   #check if markers in order of mrknames and if not, rearrange mrknames:
      #   marknrs <- sampscores$marker
      #   if (setequal(sampscores$MarkerName, mrknames)) {
      #     mrknames <- as.character(sampscores$MarkerName)
      #   } else stop ("scores2wide: all markers must be present for all samples and v.v.")
      # } else {
      #   #samp > 1, markers must be (put) in same order
      #   if (!setequal(sampscores$MarkerName, mrknames))
      #     stop ("scores2wide: all markers must be present for all samples and v.v.")
      #   if (sum(sampscores$MarkerName != mrknames) > 0)
      #     sampscores <- sampscores[match(mrknames, sampscores$MarkerName),]
      # }
      wide[, samp] <- sampscores$geno
    }
  }
  wide <- data.frame(marker = marknrs,
                     MarkerName = mrknames,
                     wide)
  colnames(wide) <- c("marker", "MarkerName", sampnames)
  rownames(wide) <- NULL
  if (!is.na(outfile)) {
    writeDatfile(wide, outfile)
  }
  invisible(wide)
} #scores2wide

# samplestats *************************************************************
#'@title Statistics for each F1 sample over all markers
#'@description Tabulate for each F1 sample in how many markers it has a
#'missing, invalid or valid score.
#'
#'@usage samplestats(chk, scores, F1, qall_flavor="qall_mult",
#'qall_threshold=0, ploidy, ploidy2, scores_long=TRUE)
#'
#'@param chk a data frame as returned by checkF1
#'@param scores A data frame as read from the scores file produced by function
#'fitMarkers of package fitPoly (or a subset with at least columns
#'MarkerName, SampleName and geno), or a data frame as returned by function
#'scores2wide. In the first case (default) parameter scores_long must be TRUE,
#'in the second case it must be FALSE.
#'@param F1 character vector with the sample names of the F1 individuals
#'@param qall_flavor which quality parameter column must be shown in compfile,
#'default "qall_mult". If no quality data are wanted, specify "".
#'@param qall_threshold only markers with a qall score > qall.threshold are
#'included in the tabulation
#'@param ploidy the ploidy of parent 1 (must be even, 2 (diploid) or larger),
#'and the same as used by checkF1 to calculate the chk data frame
#'@param ploidy2 the ploidy of parent 2. If omitted it is assumed to be equal
#'to ploidy. Should be the same as used by checkF1 to calculate the chk
#'data frame
#'@param scores_long TRUE if scores is in "long format", FALSE if it is in
#'"wide format" (see parameter scores)
#'
#'@return A matrix with samples in rows and 3 columns: missing, invalid, valid,
#'giving for each sample the number of markers where it has a missing, invalid
#'or valid dosage score
#'@export
samplestats <- function(chk, scores,
                        F1,
                        qall_flavor="qall_mult",
                        qall_threshold=0,
                        ploidy, ploidy2,
                        scores_long=TRUE) {
  if (ploidy %% 2 != 0) stop("samplestats: odd parental ploidy not allowed")
  if (missing(ploidy2) || is.na(ploidy2)) ploidy2 <- ploidy else
    if (ploidy2 %% 2 != 0) stop("samplestats: odd parental ploidy2 not allowed")
  ploidyF1 <- (ploidy + ploidy2) / 2
  if (scores_long) scores <- scores2wide(scores)
  #select and sort scores to have only the markers with qall>=threshold in chk,
  #with markers sorted as chk and individuals as F1
  #delete from chk all rows without a segtype (mostly those that are not fitted by fitPoly):
  chk <- chk[!is.na(chk$bestParentfit) & chk$bestParentfit != "",]
  qallcol <- which(tolower(names(chk)) == qall_flavor)
  if (length(qallcol) > 1) qallcol <- qallcol[1]
  chk <- chk[chk[,qallcol] > qall_threshold,]
  F1 <- intersect(F1, colnames(scores)) #keeps the order of F1
  scores_samp <- names(scores)[3:length(scores)] #ALL samples as ordered in scores
  scores <- scores[,c(1,2, match(F1, colnames(scores)))]
  mrk <- intersect(chk$MarkerName, scores$MarkerName) #keeps order of markers in chk
  scores <- scores[match(mrk, scores$MarkerName),]
  #calculate a list of valid dosage scores for each marker in chk:
  validscores <- list()
  validscores[[nrow(chk)]] <- integer(0)
  seginfo <- calcSegtypeInfo(ploidy, ploidy2)

  #nested function, to use in lapply
  get.expgeno <- function(segtypenr) { seginfo[[segtypenr]]$expgeno }

  mrk.segtype <- match(chk$bestParentfit, names(seginfo))
  if (sum(is.na(mrk.segtype)) > 0)
    stop("samplestats: not all segtypes in chk are valid, ploidy or ploidy2 are incorrect")
  mrk.expgeno <- lapply(as.list(mrk.segtype), FUN=get.expgeno)
  #calculate the totals:
  result <- matrix(integer(length(F1)*3), nrow=length(F1), ncol=3)
  rownames(result) <- F1
  colnames(result) <- c("missing", "invalid", "valid")
  for (i in seq_along(F1)) {
    sc <- scores[,i+2]
    result[i,1] <- sum(is.na(sc))
    result[i,3] <- sum(mapply("%in%", sc, mrk.expgeno), na.rm=TRUE)
  }
  result[,2] <- nrow(chk) - rowSums(result)
  result
} #samplestats

# selMarkers_qall *********************************************************
#'@title Sample markers at several qall levels
#'@description In order to get an idea of the quality of markers with
#'different levels of qall, this function samples marker numbers at a
#'specified set of qall values.
#'
#'@usage selMarkers_qall(chk, qall_levels, mrkperlevel=1,
#'qall_flavor="qall_mult")
#'
#'@param chk a data frame as returned by checkF1
#'@param qall_levels a numeric vector with the levels of the quality
#'parameter (qall_mult or qall_weights, see parameter qall_flavo)
#'at which markers should be selected
#'@param mrkperlevel the number of markers to select per level of qall
#'@param qall_flavor which qall to use: by default qall_mult,
#'also qall_weights is possible
#'
#'@return A subset of data frame chk with selected rows, in ascending order
#'of the qall_flavor column. The first <mrkperlevel> markers with qall >=
#'each qall.level are returned; duplicated selections are removed (so the
#'number of returned markers may be less than
#'length(qall_levels) * mrkperlevel)
#'@export
selMarkers_qall <- function(chk, qall_levels, mrkperlevel=1,
                            qall_flavor="qall_mult") {
  qallcol <- which(names(chk) == qall_flavor)
  if (length(qallcol) != 1) stop ("selMarkers_qall: invalid qall_flavor")
  chk <- chk[!is.na(chk[,qallcol]),]
  if (is.factor(chk[,qallcol])) chk[,qallcol] <- as.character(chk[,qallcol])
  if (is.character(chk[,qallcol])) {
    chk[,qallcol][chk[,qallcol] %in% c("", "NA")] <- NA
    chk[,qallcol] <- as.numeric(chk[,qallcol])
  }
  chk <- chk[order(chk[,qallcol]),]
  qall.levels <- sort(qall_levels)
  firstsel <- integer(0)
  for (i in seq_along(qall.levels)) {
    if (chk[nrow(chk), qallcol] < qall.levels[i]) {
      firstsel[i] <- nrow(chk) + 1 #no qall below threshold
    } else {
      firstsel[i] <- min(which(chk[,qallcol] >= qall.levels[i]))
    }
  }
  sel <- integer(0)
  for (i in 1:mrkperlevel) sel <- c(sel, firstsel + i - 1)
  sel <- sort(sel[!is.na(sel)])
  sel <- sel[sel <= nrow(chk)]
  sel <- sel[!duplicated(sel)]
  chk[sel,]
} #selMarkers_qall

# combineFiles *********************************************************
#'@title Combine the X and Y intensity scores and the assigned dosage in one file
#'@description This function combines the X and Y intensity values from the
#'fitPoly input with the geno (assigned dosage) from the fitPoly scores
#'output. Useful for producing XY scatterplots with samples colored
#'according to the assigned dosage.
#'
#'@usage combineFiles(XYdata, scores, controls=character(0))
#'
#'@param XYdata data.frame with (at least) columns MarkerName,
#'SampleName, X, Y (if present, R and ratio are also copied)
#'@param scores data.frame with scores produced by the
#'fitMarkers function of package fitPoly. It has columns MarkerName
#'and SampleName that are subsets of MarkerName and SampleName in the XYdata,
#'and at least a column geno
#'@param controls a character vector of sample names. The geno (dosage)
#'value of these samples is set to NA
#'@return a dataframe with columns MarkerName, SampleName, X, Y (and R and
#'ratio if these columns were present in XYdata), and geno, with all markers
#'and samples from XYdata. The value of geno is set to NA for all MarkerName /
#'sample combinations not present in scores, and for all samples in controls.
#'@export
combineFiles <- function(XYdata, scores, controls=character(0)) {
  #NOTE: $geno is NOT increased by 1 (as it was in earlier versions)
  if (!is.data.frame(XYdata) ||
      length(which(names(XYdata) %in% c("MarkerName", "SampleName", "X", "Y")))
      != 4)
    stop("XYdata must be a data.frame with at least columns MarkerName, SampleName, X and Y")
  XYcolumns <- match(c("MarkerName", "SampleName", "X", "Y", "R", "ratio"),
                     names(XYdata))
  XYcolumns <- XYcolumns[!is.na(XYcolumns)]
  XYdata <- XYdata[, XYcolumns]
  if (!is.data.frame(scores) ||
      length(which(names(scores) %in% c("MarkerName", "SampleName", "geno")))
      != 3)
    stop("scores must be a data.frame with at least columns MarkerName, SampleName and geno")
  scores <- data.frame(MarkerName=scores$MarkerName,
                       SampleName=scores$SampleName,
                       geno=scores$geno)
  alldat <- merge(XYdata, scores, by=c("MarkerName", "SampleName"), all=TRUE)
  alldat$MarkerName <- factor(alldat$MarkerName)
  alldat$geno[alldat$SampleName %in% controls] <- NA
  alldat$SampleName <- factor(alldat$SampleName)
  alldat
} #combineFiles

# correctDosages *********************************************************
calcPotShifts <- function(segtypes, ploidy) {
  #calculate potential shifts in dosages
  #segtypes: vector of segregation types
  #result: data.frame with columns
  #  segtype: the segtype code before the underscore i.e. without the
  #           first dosage class
  #  mindos: the first dosage class
  #  shift: -1, 0 or 1: potental shifts to try out
  # for each segtype whether a shift of +/- 1 should be
  #tried or not (0)
  #(only when number of dosages in segtype is <= ploidy-2 and
  #min.dosage==1 or max.dosage==ploidy-1)
  shift <- rep(0, length(segtypes))
  segtypes[is.na(segtypes) | segtypes==""] <- "x_0" #avoid problems with strsplit
  segtypes <- do.call(rbind, strsplit(as.character(segtypes), split="_"))
  numdos <- nchar(segtypes[,1]) #number of dosage classes in segtype
  compl <- substr(segtypes[,1],1,1) == "c"
  lencompl <- as.integer(substr(segtypes[compl, 1], 2, numdos[compl]-1))
  #           nr of dosages in complex segtype
  numdos[compl] <- lencompl
  mindos <- as.integer(segtypes[,2])
  #shift down: numdos <= ploidy-2 and lowest dosage = 1:
  shift[numdos <= ploidy-2 & mindos == 1] <- -1
  #shift up: numdos <= ploidy-2 and highest dosage = ploidy-1:
  maxdos <- mindos + numdos - 1
  shift[numdos <= ploidy-2 & maxdos == ploidy-1] <- 1
  data.frame(seg=segtypes[,1], mindos=mindos, shift=shift,
             stringsAsFactors=FALSE)
}

#'@title Check if dosage scores may have to be shifted
#'@description fitPoly sometimes uses a "shifted" model to assign dosage
#'scores (e.g. all samples are assigned a dosage one higher than the true
#'dosage). This happens mostly when there are only few dosages present
#'among the samples. This function checks if a shift of +/-1 is possible.
#'
#'@usage correctDosages(chk, scores, parent1, parent2, ploidy, ploidy2,
#'polysomic=TRUE, disomic=FALSE, mixed=FALSE, parentsScoredWithF1,
#'absent.threshold=0.04, outfile=NA)
#'
#'@param chk data frame returned by function checkF1 when called without
#'shiftmarkers
#'@param scores data.frame with scores as produced by the fitMarkers
#'function of package fitPoly; at least columns MarkerName, SampleName and geno
#'must be present, any other columns are ignored
#'@param parent1 character vector with names of the samples of parent 1
#'@param parent2 character vector with names of the samples of parent 2
#'@param ploidy The ploidy of parent 1 (must be even, 2 (diploid) or larger).
#'@param ploidy2 The ploidy of parent 2. If omitted it is
#'assumed to be equal to ploidy.
#'@param polysomic if TRUE at least all polysomic segtypes are considered;
#'if FALSE these are not specifically selected (but if e.g. disomic is TRUE,
#'any polysomic segtypes that are also disomic will still be considered);
#'same as used in the call to checkF1 that generated data.frame chk
#'@param disomic if TRUE at least all disomic segtypes are considered (see
#'param polysomic); same as used in the call to checkF1 that generated
#'data.frame chk
#'@param mixed if TRUE at least all mixed segtypes are considered (see
#'param polysomic). A mixed segtype occurs when inheritance in one parent is
#'polysomic (random chromosome pairing) and in the other parent disomic (fully
#'preferential chromosome pairing); same as used in the call to checkF1 that
#'generated data.frame chk
#'@param parentsScoredWithF1 single logical. TRUE means that parents are scored
#'in the same experiment and the same fitPoly run as the F1, else FALSE.
#'If missing and ploidy2==ploidy, TRUE is assumed.
#'If FALSE, parental scores will not be shifted along with the F1 scores.
#'@param absent.threshold the threshold for the fraction of ALL samples
#'that has the dosage that is assumed to be absent due to mis-fitting of
#'fitPoly; should be at least the assumed error rate of the fitPoly scoring
#'assuming the fitted model is correct
#'@param outfile file name to which the result is written. If NA no file
#'is written.
#'
#'@return a data frame with columns
#'\itemize{
#'\item{markername}
#'\item{segtype: the bestfit (not bestParentfit!) segtype from chk}
#'\item{parent1, parent2: the consensus parental dosages; possibly
#'low-confidence, so may be different from those reported in chk}
#'\item{shift: -1, 0 or 1: the amount by which this marker should be shifted}
#'}
#'The next fields are only calculated if shift is not 0:
#'\itemize{
#'\item{fracNotOk: the fraction of ALL samples that are in the dosage
#'(0 or ploidy) that should be empty if the marker is indeed shifted.}
#'\item{parNA: the number of parental dosages that is missing (0, 1 or 2)}
#'}
#'@details A shift of -1 (or +1) is proposed when (1) the fraction of all
#'samples with dosage 0 (or ploidy) is below absent.threshold, (2) the
#'bestfit (not bestParentfit!) segtype in chk has one empty dosage on the
#'low (or high) side and more than one empty dosage at the high (or low) side,
#'and (3) the shifted consensus parental dosages do not conflict with the
#'shifted segregation type.\cr
#'The returned data.frame (or a subset, e.g. based on the values in the
#'fracNotOk and parNA columns) can serve as parameter shiftmarkers in a
#'new call to checkF1.\cr
#'Based on the quality scores assigned by checkF1 to
#'the original and shifted versions of each marker the user can decide if
#'either or both should be kept. A data.frame combining selected rows
#'of the original and shifted versions of the checkF1 output (which may
#'contain both a shifted and an unshifted version of some markers) can then be
#'used as input to compareProbes or writeDosagefile.

#'@export
correctDosages <- function(chk, scores, parent1, parent2, ploidy, ploidy2,
                           polysomic=TRUE, disomic=FALSE, mixed=FALSE,
                           parentsScoredWithF1,
                           absent.threshold=0.04, outfile=NA) {
  #This function identifies markers that are probably misscored by fitPoly
  #(bestfit in 1_1, 1_3, 11_1, 11_2 (if ploidy==4) and parents in same dosage
  #classes;
  # note that 1_1 and 1_3 can only occur if disomic=TRUE in check.F1)
  #and produce data to suggests a revised scoring for F1 and parent, based on
  #the segregation in an F1 population and the scores in all other samples
  if (missing(ploidy2) || is.na(ploidy2)) {
    ploidy2 <- ploidy
    if (missing(parentsScoredWithF1)) parentsScoredWithF1 <- TRUE
  }  else
    if (ploidy2 %% 2 != 0) stop("checkF1: odd ploidy2 not allowed")
  ploidyF1 <- (ploidy + ploidy2) / 2

  if (is.na(outfile)) outfile <- ""
  if (outfile != "" && !checkFilename(outfile)) {
    stop("correctDosages: invalid outfile")
  }
  #if (ploidy %% 2 != 0)
  #  stop("correctDosages: only even ploidy allowed (parents must have been genotyped together with F1")
  if (sum(is.na(match(c(parent1, parent2), names(chk)))) > 0)
    stop("correctDosages: not all parental samples in names of chk")
  if ("shift" %in% names(chk))
    stop("correctDosages: chk must be calculated without shift")
  seginfo <- segtypeInfoSummary(calcSegtypeInfo(ploidy, ploidy2))
  seginfo <- seginfo[(polysomic & seginfo$par.poly==1) |
                       (disomic & seginfo$par.di==1) |
                       (mixed & seginfo$par.mixed==1),]
  chk <- chk2integer(chk)
  if (is.factor(chk$bestfit)) chk$bestfit <- as.character(chk$bestfit)
  NAcol <- which(names(chk) == "F1_NA") # last before parental samples

  #calculate a (possibly very low conf) new parental consensus:
  #(the one in chk may be based partly on bestParentfit)
  for (p in 1:2) {
    if (p==1) parent <- parent1 else parent <- parent2
    pc <- which(names(chk) == "parent1") - 1 + p
    chk[,pc] <- rep(NA, nrow(chk))
    if (length(parent) > 0) {
      #for calculating the consensus score (pcons) we need to calculate the
      #parallel min and max over the columns of the parental samples;
      pcons <- apply(chk[,(NAcol+1):(NAcol+length(parent)), drop=FALSE],
                     1, min) #no na.rm=TRUE !!
      rw <- !is.na(pcons) &  pcons ==
        apply(chk[,(NAcol+1):(NAcol+length(parent)), drop=FALSE], 1, max)
      chk[rw, pc] <- pcons[rw] #so the "old" parent1 and parent2 columns now
      #                         have a new type of consensus score
      NAcol <- NAcol + length(parent) #startpoint for parent2
    }
  }
  checksamp <- as.character(unique(scores$SampleName))
  if (!parentsScoredWithF1)
    checksamp <- setdiff(checksamp, c(parent1, parent2))
  scores <- scores[scores$MarkerName %in% as.character(chk$MarkerName) &
                     scores$SampleName %in% checksamp,]
  scores$MarkerName <- factor(scores$MarkerName) #drop unused levels
  nonNA <- function(x) sum(!is.na(x))
  is0 <- function(x) sum(x==0, na.rm=TRUE)
  isploidy <- function(x) sum(x==ploidyF1, na.rm=TRUE)
  scores.nonNA <- tapply(scores$geno, scores$MarkerName, FUN=nonNA)
  scores.frc0 <- tapply(scores$geno, scores$MarkerName, FUN=is0)
  scores.frc0 <- scores.frc0 / scores.nonNA
  scores.frcploidy <- tapply(scores$geno, scores$MarkerName, FUN=isploidy)
  scores.frcploidy <- scores.frcploidy / scores.nonNA
  scores.frc0 <- scores.frc0[order(names(scores.frc0))] #needed!
  scores.frcploidy <- scores.frcploidy[order(names(scores.frcploidy))] #needed!
  #these are now in alphabetical order; do the same with the chk data frame:
  ch.ord <- order(chk$MarkerName) #we need this later
  chk <- chk[ch.ord,]
  potShifts <- calcPotShifts(chk$bestfit, ploidyF1) #potential shift value of
  #                                                  segtype for each marker
  #for which rows do we need to check a potential shift?
  mf <- !is.na(potShifts$shift) &
    ( (potShifts$shift == -1 & scores.frc0 <= absent.threshold) |
        (potShifts$shift == 1 & scores.frcploidy <= absent.threshold) )
  if (parentsScoredWithF1) {
    shfpar <- cbind(chk$parent1 + potShifts$shift,
                    chk$parent2 + potShifts$shift)[mf,]
  } else shfpar <- cbind(chk$parent1, chk$parent2)[mf,]
  # (non)shifted parental consensus dosages, only for the mf markers
  shf_segtype <- paste(potShifts[mf, 1],
                       (potShifts[mf, 2] + potShifts[mf, 3]), sep= "_")
  #    shifted segtype: original first dosage + shift, only for the mf markers

  match_p_seg <- function(segtype, p1, p2) {
    # for one segtype: do the parental consensus dosages (p12) match it?
    if (is.na(p1) && is.na(p2)) return(TRUE)
    si <- seginfo[seginfo$segtype==segtype, 3:4] #parental dosage columns
    if (is.na(p1)) return(p2 %in% si[,2])
    #p1 now not NA
    if (is.na(p2)) return(p1 %in% si[,1])
    return(any(si[,1]==p1 & si[,2]==p2))
  }

  goodshifts <- as.logical(mapply(FUN=match_p_seg,
                                  shf_segtype, shfpar[,1], shfpar[,2]))
  #       as.logical needed because mapply returns an empty list if mf all FALSE
  shiftmf <- rep(0, sum(mf))
  shiftmf[goodshifts] <- potShifts$shift[mf][goodshifts]
  p1namf <- p2namf <- onepokmf <- bothpokmf <- bothpnamf <- rep(NA, sum(mf))
  p1namf[goodshifts] <- is.na(shfpar[goodshifts, 1])
  p2namf[goodshifts] <- is.na(shfpar[goodshifts, 2])
  nmrk <- nrow(chk)
  P1na = rep(NA, nmrk)
  P2na = rep(NA, nmrk)
  result <- data.frame(
    MarkerName = chk$MarkerName,
    segtype = chk$bestfit,
    parent1 = chk$parent1,
    parent2 = chk$parent2,
    shift = rep(0, nmrk),
    fracNotOk = rep(NA, nmrk) #fraction of all samples in the forbidden class
    #                           (0 or ploidyF1)
  )
  result$shift[mf] <- shiftmf
  #the next columns will get non-NA values only for accepted shifts
  result$fracNotOk[result$shift == -1] <- scores.frc0[result$shift == -1]
  result$fracNotOk[result$shift == 1] <- scores.frcploidy[result$shift == 1]
  P1na[mf] <- p1namf
  P2na[mf] <- p2namf
  result$ParNA <- P1na + P2na #0, 1 or 2

  #sort result into same order as original chk:
  result <- result[order(ch.ord),]
  if (outfile != "")
    writeDatfile(result, outfile)
  invisible(result)
} #correctDosages

#Functions to recover from a fitMarkers crash when the temporary batchfiles
#(RData files) are still there:

#'@title get the names of all batch files present
#'@description This function gets the names of all batch files already produced
#'by fitMarkers; used to recover data after a crash.
#'@usage getBatchFiles(filePrefix, set)
#'@param filePrefix the same filePrefix as used in the fitMarkers call
#'@param set one of "models" or "scores"
#'@return a vector of batch file names, either the scores set or the models set
#'@export
getBatchFiles <- function(filePrefix, set) {
  #filePrefix is the same as that for fitMarkers, including path if needed
  #set should be one of "models" or "scores"
  #result: a list of all temporary batch-filenames for either scores or models
  allfiles <- dir()
  fname_start <- paste(filePrefix, "_sMM", set, sep="")
  len <- nchar(fname_start)
  sort(allfiles[substr(allfiles,1,len) == fname_start])
} #getBatchFiles

#'@title Construct the log, models and scores files from a set of batch files
#'@description Get the data saved by fitMarkers in batch files before it
#'crashed and construct the log, models and score files just as fitMarkers
#'would have done
#'@usage concatbatch(batchfiles)
#'@param batchfiles a vector of batch file names as returned by getBatchFiles
#'@details all batch files are assumed to contain a list with the same number
#'of element, each either a character vector or a data.frame. This function
#'concatenates the elements across the batch files.\cr
#'This may be useful if fitMarkers has already been running a long time
#'and then crashed. The partial (or perhaps complete) data can then be
#'recovered from the saved batch files, so that only the remaining markers
#'(if any) need to be processed afterwards.
#'@return a list with the concatenated elements: character vectors and/or
#'data.frames
#'@export
concatbatch <- function(batchfiles) {
  #batchfiles is a vector of filenames as returned by getBatchFiles
  #each of the files in batchfiles are assumed to contain a list; in each case
  #with the same elements (data.frames and/or vectors)
  #The result is a similar list where each element is a concatenation
  #(with rbind() or c()) of the list elements from the batchfiles
  #These can then be saved to files, after sorting if desired. They contain
  #the same information as the output files of fitMarkers, but only for
  #the completed batches.
  vars <- load(batchfiles[1])
  if (length(vars) != 1) stop("batchfiles not supported")
  allbat <- get(vars)
  if (!is.list(allbat)) stop("batchfiles not supported")
  for (f in 2:length(batchfiles)) {
    print(paste("f =", f))
    vf <- load(batchfiles[f])
    if (length(vf) != 1) stop("batchfiles different types")
    nwbat <- get(vf)
    if (!is.list(nwbat) || !identical(names(allbat), names(nwbat)))
      stop("batchfiles different types")
    for (i in 1:length(nwbat)) {
      if (is.data.frame(nwbat[[i]])) {
        allbat[[i]] <- rbind(allbat[[i]], nwbat[[i]])
      } else if (is.vector(nwbat[[i]]))
        allbat[[i]] <- c(allbat[[i]], nwbat[[i]])
    }
  }
  allbat
}