R/fitPolyTools.R

Defines functions concatbatch getBatchFiles correctDosages calcPotShifts combineFiles selMarkers_qall samplestats scores2wide checkFilename expandUnknownParents F1Dosages2Matrix removeRedundant compareProbes writeDosagefile makeCombscoresDf trim_shf getPargeno chk2integer rightstr leftstr checkF1 calc_qall calc_q3 calc_q2 calc_q1 linterpol checkSelfing shiftdosages getConsensusGeno getMatchParents segtypeInfoSummary selSegtypeInfo listSegtypes calcSegtypeInfo makeProgeny digamfrq polygamfrq gcd_all gcd padded writeDatfile readDatfile check.filename makeFitPolyFiles allDevOff drawXYplots XY_plot calc.R.threshold get.genocol get.genocol.old selMarkers_byR calcRstats splitNrenameSamples check.sampletable removeNArows wide2longAxiom readAxiomSummary wide2longGS readFullDataTable

Documented in calcRstats calcSegtypeInfo checkF1 checkFilename combineFiles compareProbes concatbatch correctDosages drawXYplots expandUnknownParents F1Dosages2Matrix getBatchFiles get.genocol leftstr listSegtypes makeFitPolyFiles readAxiomSummary readDatfile readFullDataTable removeRedundant rightstr samplestats scores2wide segtypeInfoSummary selMarkers_byR selMarkers_qall selSegtypeInfo splitNrenameSamples writeDatfile writeDosagefile XY_plot

#'fitPolyTools: a package with functions related to package fitPoly
#'
#'Package fitPolyTools offers some functions for preparing and selecting
#'input data for package fitPoly, and for inspection and analysis
#'of polyploid dosage calls as produced by fitPoly. These analyses focus
#'on the segregation of dosages in an F1 progeny of two parents. The functions
#'can be grouped as follows:
#'
#'@section data preparation functions:
#'Functions readFullDataTable and readAxiomSummary convert files generated by
#'Illumina's GenomeStudio and Affymetrix Power Tools software to the input
#'format for fitPoly::fitPoly and fitPoly::fitMarkers.
#'
#'@section data visualization functions:
#'Functions XYplot and drawXYplots draw scatterplots of the samples for one
#'marker at a time, possibly with colors indicating the assigned dosages and/or
#'highlighting specific samples. Function combineFiles combines X and Y columns
#'from one file or data.frame with column geno from another.
#'
#'@section segregation type functions:
#'The functions calcSegtypeInfo, selSegtypeInfo and segtypeInfoSummary
#'produce information on the possible F1 segregation types. These functions
#'are primarily developed for use in other functions but are also available
#'for users.
#'
#'@section identification of the probable segregation types:
#'Function checkF1 gives for all markers the most likely (best fitting)
#'segregation type, based on the F1 and the parental samples. Function
#'correctDosages checks for possibly "shifted" markers (e.g. all dosages
#'scored one lower or higher than the true dosage).
#' Function checkF1 can apply shifts when specified.
#'
#'@section combining data from two probes for the same SNP:
#'On Affymetrix Axiom SNP arrays it is possible to have the same SNP
#'interrogated from both sides by two different probes. Function compareProbes
#'compares the results from both probes and produces a new, merged marker if
#'they are sufficiently similar. It also writes the results to a data file
#'suitable for linkage mapping functions (package polymapR).\cr
#'Function removeRedundant takes the result of compareProbes; if a merged
#'marker is present it removes the corresponding separate probe markers.
#'
#'@section producing a data file for mapping:
#'Function writeDosagefile produces a data file suitable for linkage mapping
#'functions (package polymapR). This data file has the same format as that
#'produced by compareProbes and removeRedundant (see previous section).\cr
#'Function F1Dosages2Matrix reads such a dosages file and returns a matrix
#'in the format required by the polymapR package.
#'
#'@section reformatting of fitPoly scores file:
#'Function scores2wide reformats the scores file produced by the
#'fitMarkers function of the fitPoly package to "wide" format.
#'
#'@section recovery functions:
#'Functions getBatchfiles and concatbatch allow to recover results from already
#'processed batches of markers, in case fitMarkers crashes.
#'
#'@section general tools:
#'These tools may be useful by themselves and are therefore made available too.
#'They include leftstr and rightstr: two functions that return substrings from
#'the left or right side, and function checkFilename which tests if a file
#'with the name can be created. Functions readDatfile and writeDatfile are
#'wrappers for read.table and write.table with different default options,
#'suitable for the file formats expected or produced by functions in this
#'package oand fitPoly.
#'
#'@docType package
#'@name fitPolyTools
NULL

# Roeland Voorrips, Wageningen UR - Plant Breeding

# fitPolyTools contains functions for
# - conversion of data files from array software formats to fitPoly input format
# - drawing XY-scatterplots with or without genotype dosage colors
# - checking dosage scores against expected F1 segregation patterns
# - detecting and correcting "shifted" markers
# - combining scores from two probes (assays) of the same SNP
# - writing genotype files for F1 populations
# - recovering data from temporary files of fitPoly::fitMarkers
#   in case it crashes

#'@importFrom grDevices col2rgb dev.off dev.list png rgb
#'@importFrom graphics layout lines plot points
#'@importFrom stats chisq.test pbinom quantile
#'@importFrom utils read.table write.table

#'@title A data set containing SNP array data
#'@description A data frame containing SNP array signal intensity data
#'@details XYdat contains the following columns:
#'\itemize{
#'  \item MarkerName: factor, names of the markers (SNPs)
#'  \item SampleName: factor names of the individuals / DNA samples
#'  \item X: numeric, the signal intensities of the X allele
#'  \item Y: numeric, the signal intensities of the Y allele
#'  \item R: numeric, the total signal intensities; R = X + Y
#'  \item ratio: numeric, the ratios of the Y signal to the total signal
#'  intensity; ratio = Y / R
#'}
#'@name XYdat
#'@usage data(XYdat)
#'@docType data
NULL

#'@title A data set with dosage scores generated by fitPoly
#'@description A data frame as saved by function fitMarkers in package
#'fitPoly
#'@details scores contains the following columns:
#'\itemize{
#'  \item marker: numeric, the sequential number of the markers in the
#'  fitMarkers input file
#'  \item MarkerName: factor, names of the markers (SNPs)
#'  \item SampleName: factor names of the individuals / DNA samples
#'  \item ratio: numeric, the ratios of the Y signal to the total signal
#'  intensity
#'  \item P0 ... P4: numeric, the probabilities that the observed signal ratio
#'  reflects a true Y allele dosage of 0 ... 4, according to the selected
#'  mixture model
#'  \item maxgeno: numeric, the dosage with the maximum probability
#'  \item maxP: numeric, the maximum probability, i.e. the probability of maxgeno
#'  \item geno: numeric, equal to maxgeno if maxP was above the specified threshold
#'  (p.threshold=0.99), else NA
#'}
#'@name scores
#'@usage data(scores)
#'@docType data
NULL


# Read data from SNP arrays **************************************************
#'@title convert a GenomeStudio FullDataTable file to the import format
#'for fitPoly
#'@description A GenomeStudio file in wide format (samples side-by-side) is
#'converted to a fitPoly input file in long format
#'@usage readFullDataTable(filename, rawXY=FALSE,
#'markergroups=list(), out, filetype=c("dat","RData")[2])
#'@param filename name of a FullDataTable tab-separated text file exported from
#'Illumina's GenomeStudio. The file must contain a column "Name" with the
#'marker names, and for each sample a pair of columns "sample.X" and "sample.Y"
#'if rawXY is FALSE, or "sample.X raw" and "sample.Y raw" (note the space)
#'if rawXY is TRUE. Further columns may be present but are not read.
#'@param rawXY if FALSE (default) the normalized .X and .Y columns are read;
#'if TRUE the "raw" columns (.X raw and .Y raw) are read instead.
#'@param markergroups a list with character vectors of marker names, or integer
#'vectors of marker numbers in file order. If the data set is large, the
#'conversion to long format may exceed memory limits. In these cases the data
#'can be split into marker groups that are converted separately and each saved
#'to a separate file. If the list is empty (default) all markers are converted
#'as one block.
#'@param out the name of an output file (without extension). If a list of
#'markergroups is given, out must be a valid file name (without extension);
#'in that case multiple output files are created with filenames in which
#'the list element numbers are appended to out. If no markergroups are specified
#'out may also be set to "" or NA; in that case no file is created and the
#'converted data are only returned as function result.\cr
#'If out is not "" or NA, then also a file <out>_meanR.dat is saved with for
#'all samples their mean R value and  number of missing data (over all markers)
#'@param filetype either "dat" or "RData" (default): the former produces
#'tab-separated text files, the latter saves RData files with the converted
#'data in a data frame with name "dat".
#'@details The wide-format input is converted to a long-format form with
#'columns MarkerName, SampleName, X, Y, R (= X + Y) and ratio (= Y / R).
#'The X and Y signal intensities are obtained from the <sample>.X and <sample>.Y
#'columns in the input data (or from the <sample>.X raw and <sample>.Y raw
#'columns if rawXY is TRUE). R and ratio are calculated from these values
#'and not read from the input data.
#'@return If no markergroups are specified, a data.frame is returned with
#'columns MarkerName, SampleName, X and Y (also if raw data are read, the
#'column names are X and Y), R (= X + Y), ratio (= Y / (X+Y) ).\cr
#'If a list of markergroups is specified the function result is NULL
#'and the converted data.frames are only saved as files.\cr
#'If the saved files are RData files, they all contain one data.frame named
#'"dat".
#'@export
readFullDataTable <- function(filename, rawXY=FALSE,
                              markergroups=list(),
                              out, filetype=c("dat","RData")[2]) {
  #Here, in contrast to readAxiomSummary, we don't allow to pass a data.frame
  #instead of the filename, because we need to decide which columns to read

  #test input:
  if (!is.list(markergroups)) {
    stop("markergroups is not list")
  }
  if (missing(out)) stop("error in readFullDataTable: out must be specified")
  if (length(out) != 1) stop("out must be one filename or NA")
  if (!is.na(out)) {
    if (!is.character(out)) {
      stop("out must be one filename")
    }
    if (is.na(out) || out == "") {
      out <- NA
      if (length(markergroups) > 0) {
        stop("when markergroups is not empty, out must be a valid filename")
      }
    } else {
      if (!check.filename(paste(out,"714x996y8.xxx",sep="_"))) {
        stop("out is not a valid filename")
      }
    }
  }

  filetype <- tolower(filetype)
  if (!is.na(out) && filetype != "dat" && filetype != "rdata") {
    stop("filetype must be 'dat' or 'RData'")
  }
  #first we read the caption line, to select the columns to read and to
  #get the unmodified sample names:
  captions <- scan(file=filename, what="character", allowEscapes=F, nlines=1,
                   quiet=TRUE, sep="\t")
  namecol <- which(captions == "Name")
  nch <- nchar(captions)
  if (rawXY) {
    right <- substr(captions, nch-5, nch)
    Xcols <- which(right == ".X raw")
    Ycols <- which(right == ".Y raw")
    samplenames <- substr(Xcols, 1, nchar(Xcols) - 6)
    if (length(Xcols) == 0 && length(Ycols) == 0) {
      #try again without space
      right <- substr(captions, nch - 4, nch)
      Xcols <- which(right == ".Xraw")
      Ycols <- which(right == ".Yraw")
      samplenames <- substr(Xcols, 1, nchar(Xcols) - 5)
    }
    if (length(Xcols) == 0 || length(Ycols) == 0 ||
        length(Xcols) != length(Ycols)) {
      stop("for each sample columns 'sample.X raw' and 'sample.Y raw' must be present")
    }
  } else {
    right <- substr(captions, nch-1, nch)
    Xcols <- which(right == ".X")
    Ycols <- which(right == ".Y")
    samplenames <- substr(captions[Xcols], 1, nch[Xcols] - 2)
    if (length(Xcols) == 0 || length(Ycols) == 0 ||
        length(Xcols) != length(Ycols)) {
      stop("for each sample columns 'sample.X' and 'sample.Y' must be present")
    }
  }
  minXY <- min(Ycols - Xcols)
  maxXY <- max(Ycols - Xcols)
  if (maxXY != minXY) {
    stop("columns must be in the same order for each sample")
  }
  if (minXY > 0) XY <- c("X", "Y") else XY <- c("Y", "X")
  #next we read the selected columns:
  colClasses <- rep("NULL", length(captions))
  colClasses[namecol] <- "character"
  colClasses[c(Xcols, Ycols)] <- "numeric"
  dat_wide <- readDatfile(filename, colClasses=colClasses)
  dat_wide <- dat_wide[dat_wide[1] != "",]
  #convert to long format:
  Rfile <- paste(out,"meanR.dat", sep="_")
  na.count <- function(x) { sum(is.na(x)) }
  if (length(markergroups) == 0) {
    #no groups, all markers in one go:
    result <- wide2longGS(GSdata=dat_wide, samplenames, XY,
                          out, rdata=(filetype=="rdata"))
    if (!is.na(out)) {
      sumsR <- tapply(result$R, INDEX=result$SampleName, FUN=sum, na.rm=TRUE)
      nasR <- tapply(result$R, INDEX=result$SampleName, FUN=na.count)
      meanR <- data.frame(SampleName=names(sumsR),
                         meanR=sumsR / (nrow(dat_wide) - nasR),
                         missing=nasR)
      writeDatfile(meanR, Rfile)
    }
    return(invisible(result))
  } else {
    #convert per marker group:
    sumsR <- NULL
    for (mg in seq_along(markergroups)) {
      if (is.numeric(markergroups[[mg]])) {
        #markergroups[[mg]] has the row numbers of this group
        subdat <- dat_wide[markergroups[[mg]],]
      } else {
        #markergroups[[mg]] has the marker names of this group
        subdat <- dat_wide[dat_wide[[1]] %in% markergroups[[mg]],]
      }
      fname <- paste(out, padded(mg, length(markergroups)), sep="_")
      dat <- wide2longGS(GSdata=subdat, samplenames, XY,
                         fname, rdata=(filetype=="rdata"))
      if (is.null(sumsR)) {
        sumsR <- tapply(dat$R, INDEX=dat$SampleName, FUN=sum, na.rm=TRUE)
        nasR <- tapply(dat$R, INDEX=dat$SampleName, FUN=na.count)
      } else {
        sumsR <- rowSums(cbind(sumsR,
                               tapply(dat$R, INDEX=dat$SampleName, FUN=sum, na.rm=TRUE)),
                         na.rm=TRUE)
        nasR <- nasR + tapply(dat$R, INDEX=dat$SampleName, FUN=na.count)
      }
    }
    meanR <- data.frame(SampleName=names(sumsR),
                        meanR=sumsR/(length(unlist(markergroups))-nasR),
                        missing=nasR)
    writeDatfile(meanR, Rfile)
    if (length(result) > 0) {
      if (!is.null(names(markergroups))) {
        names(result) <- names(markergroups)
      }
      return(invisible(result))
    } else return(invisible(NULL))
  }
} #readFullDataTable

wide2longGS <- function(GSdata, samplenames, XY,
                        outputfilename, rdata) {
  #helper function for readFullDataTable, converts one wide data frame to
  #one long data frame.
  #GSdata: a wide data frame from a GenomeStudio FullDataTable, containing
  #only the markers in this group with X and Y signal intensities (marker names
  #in column 1, 2 columns with X and Y signals per sample)
  #samplenames: all sample names in order of columns of GSdata
  #XY: c("X", "Y") or c("Y", "X"), depending on order per sample
  #outputfilename: the complete output file name, or NA
  #rdata: TRUE or FALSE
  #Function result: the same data as a long data frame, with columns R and ratio

  #we want:
  #MarkerName SampleName   X   Y
  #M1         S1         111 112
  #M2         S1         211 212
  #M3         S1         311 312
  #M1         S2         121 122
  #M2         S2         221 222
  #M3         S2         321 322

  #from GenomeStudio we have:
  #marker S1.X S1.Y S2.X S2.Y
  #M1      111  112  121  122
  #M2      211  212  221  222
  #M3      311  312  321  322

  #so we take 2 columns at a time and put them under each other
  #(Note that the result of wide2longGS is sorted first by SampleName
  # and within sample by MarkerName, in the result of wide2longAxiom
  # it is the other way round)

  markernames <- GSdata[,1]
  GSdata <- as.matrix(GSdata[,-1])
  nmrk <- nrow(GSdata)
  nsamp <- ncol(GSdata) / 2
  if (length(samplenames) != nsamp) stop("wide2longGS: samplenames don't match GSdata")
  dat <- matrix(0.0, nrow=nmrk * nsamp, ncol=2)
  if (nmrk > 0 && nsamp > 0) {
    for (s in 1:nsamp) {
      dat[(1 + (s-1) * nmrk) : (s * nmrk),] <-
        GSdata[, (1 + (s-1) * 2) : (s * 2)]
    }
  }
  dat <- data.frame(
    MarkerName=rep(markernames, times=nsamp),
    SampleName=rep(samplenames, each=nmrk),
    dat)
  #if Y appears before X, exchange:
  names(dat)[3:4] <- XY
  if (XY[2] == "X") dat <- dat[,c(1, 2, 4, 3)]
  dat$R <- dat$X + dat$Y
  dat$ratio <- dat$Y / dat$R
  #write file:
  if (!is.na(outputfilename)) {
    if (rdata) {
      save(dat, file=paste(outputfilename, ".RData", sep=""))
    } else {
      writeDatfile(dat, file=paste(outputfilename, ".dat", sep=""))
    }
  }
  dat
} #wide2longGS

#'@title convert an Affymetrix AxiomCT1.summary file to the import format
#'for fitPoly
#'@description An Affymetrix AxiomCT1.summary.txt file in wide format (samples
#'side-by-side) is converted to a fitPoly input file in long format
#'@usage readAxiomSummary(AXdata, markergroups=list(),
#'out, filetype=c("dat", "RData")[2])
#'@param AXdata name of an AxiomCT1.summary.txt tab-separated text file
#'exported from Affymetrix Power Tools (APT); or a data.frame in the same
#'format:\cr
#'the file may start with a number of comment lines (starting with "#")
#'followed by a header line of which the first caption is "probeset_id" and all
#'other captions are sample names, all ending with ".CEL". For each marker there
#'are two lines, with probeset_id ending in "-A" and "-B"; the two lines for
#'each marker are consecutive. In addition there may be any number of control
#'markers, for each of which there is only one line, with probeset_id ending
#'in "-A".
#'@param markergroups a list with character vectors of marker names, or integer
#'vectors of marker numbers in file order. If the data set is large, the
#'conversion to long format may exceed memory limits. In these cases the data
#'can be split into marker groups that are converted separately and each saved
#'to a file (or returned as list items, but if that is possible splitting into
#'groups would not be needed)
#'@param out the name of an output file (without extension). If a list of
#'markergroups is given, out must be a valid file name (without extension);
#'in that case multiple output files are created with filenames in which
#'the list element numbers are appended to out. If no markergroups are specified
#'out may also be set to "" or NA; in that case no file is created and the
#'converted data are only returned as function result.\cr
#'If out is not "" or NA, then also a file <out>_meanR.dat is saved with for
#'all samples their mean R value and  number of missing data (over all markers)
#'@param filetype either "dat" or "RData" (default): the former produces
#'tab-separated text files, the latter saves RData files with the converted
#'data in a data frame with name "dat"
#'@details The wide-format input is converted to a long-format form with
#'columns MarkerName, SampleName, X, Y, R (= X + Y) and ratio (= Y / R).
#'The X and Y signal intensities are obtained from the <marker>-A and <marker>-B
#'rows in the input data, R and ratio are calculated from these values.
#'@return If no markergroups are specified, a data.frame is returned with
#'columns MarkerName, SampleName, X, Y, R (= X + Y), ratio (= Y / (X+Y) ).\cr
#'If a list of markergroups is specified the function result is NULL
#'and the converted data.frames are only saved as files.\cr
#'If the saved files are RData files, they all contain one data.frame named
#'"dat".
#'@export
readAxiomSummary <- function(AXdata,
                             markergroups=list(),
                             out,
                             filetype=c("dat", "RData")[2]) {
  #test input:
  if (!is.list(markergroups)) {
    stop("markergroups is not list")
  }
  if (missing(out)) stop("out must be specified")
  if (length(out) != 1) stop("out must be one filename or NA")
  if (!is.na(out)) {
    if (!is.character(out)) {
      stop("out must be one filename")
    }
    if (is.na(out) || out == "") {
      out <- NA
    } else {
      if (!check.filename(paste(out,"714x996y8.xxx",sep="_"))) {
        stop("out is not valid")
      }
    }
  }
  filetype <- tolower(filetype)
  if (!is.na(out) && filetype != "dat" && filetype != "rdata") {
    stop("error in readAxiomSummary: filetype must be 'dat' or 'RData'")
  }
  if (is.character(AXdata)) {
    #filename: read file, trim ".CEL" from sample names
    #find the caption line and read sample names
    #(dashes and other unallowed characters in names are converted to dots in
    #read.table)
    con <- file(AXdata, "r")
    repeat {
      s <- readLines(con, n=1)
      if (is.null(s) || length(s)==0) {
        stop("could not read AXdata file, or end reached without finding caption line")
      }
      s <- strsplit(s, split="\t")
      if (length(s) > 0) {
        s <- gsub("(^ +)|( +$)", "", s[[1]], fixed=TRUE)
        #    a.o. remove leading & trailing whitespace
        if (nchar(s[1]) > 0 && substr(s[1], 1, 1) != "#") break;
      }
    }
    close(con)
    if (s[1] != "probeset_id") {
      stop("caption line doesn't start with probeset_id")
    }
    #remove the .CEL suffixes from the sample names if they are there:
    samplenames <- s[2:length(s)]
    nc <- nchar(samplenames)
    cel <- substr(samplenames,nc-3, nc) == ".CEL"
    samplenames[cel] <- substr(samplenames[cel], 1, nc-4)

    #read the total file, skipping the initial comment lines:
    #knowing markercount does not help because of the presence of very many
    #control markers (DQC) after the last real marker
    dat_wide <- read.table(AXdata, comment.char="#", sep="\t", header=TRUE,
                           nrows=-1)
    dat_wide[,1] <- as.character(dat_wide[,1])
    dat_wide <- dat_wide[dat_wide[,1]!="",]
    nc <- nchar(dat_wide$probeset_id)
    if (anyNA(nc) || any(nc < 3))
      stop("some data lines have a probeset_id shorter than 3 characters")
    lastchar <- substring(dat_wide$probeset_id, nc)
    # old version:
    #  # exclude the DQC lines (the block of lines with probeset_id's all ending
    #  # with "-A"):
    #  dif <- lastchar[1:(length(lastchar)-1)] != lastchar[2:length(lastchar)]
    #  # at the bottom: if ending in -A it is a DQC, else the B-line of a real mrk:
    #  dif[length(lastchar)] <- lastchar[length(lastchar)] == "B"
    #  # when a block of DQC follows a block of real markers: ok
    #  # when a block of real markers follows a block of DQC: also ok
    #  dat_wide <- subset(dat_wide, substr(probeset_id, 1, 4) != "DQC-")
    #  dat_wide <- dat_wide[dif,]; lastchar <- lastchar[dif]#
    # new version 20190221: don't assume that the DQC lines are all at the
    # bottom;
    # instead, look for all probeset_id with -B that have the same probeset_id
    # with -A on the preceding line:
    Blines <- setdiff(which(lastchar == "B"), 1)
    Alines <- Blines - 1
    Alines <- Alines[lastchar[Alines] == "A"]
    Blines <- Alines + 1
    probnames <- substr(dat_wide$probeset_id, 1, nc-2) # without -A or -B suffix
    if (!all(probnames[Alines] == probnames[Blines]))
      stop("some A/B line pairs have a different probeset name")
    dat_wide <- dat_wide[sort(c(Alines, Blines)),]
    dat_wide <- dat_wide[substr(dat_wide$probeset_id, 1, 4) != "DQC-",]
    rm(lastchar, probnames, Alines, Blines)
  } else if (is.data.frame(AXdata)) {
    if (names(AXdata)[1] != "probeset_id")
      stop("invalid AXdata data frame")
    #have (only) the correct rows been read?
    ok <- FALSE
    if (nrow(AXdata) %% 2 == 0) {
      nm <- as.character(AXdata[,1]); nc <- nchar(nm)
      arows <- seq(1, nrow(AXdata), 2)
      ok <- all(substr(nm[arows],nc[arows], nc[arows]) == "A") &&
        all(substr(nm[arows+1], nc[arows+1], nc[arows+1]) == "B")
      rm(nm, arows)
    }
    if (!ok) stop("AXdata should only contain alternating A and B rows")
    dat_wide <- AXdata
    samplenames <- names(dat_wide)[-1]
  } else stop("readAxiomSummary: AXdata must be file name or data frame")

  #test the markernames:
  if (nrow(dat_wide) %% 2 != 0) {
    stop("readAxiomSummary: odd number of marker lines")
  }
  nmrk <- nrow(dat_wide) / 2
  anames <- as.character(dat_wide[,1][rep(c(TRUE, FALSE), times=nmrk)])
  nch <- nchar(anames)
  err <- which(substr(anames, nch-1, nch) != "-A")
  if (length(err) != 0) stop("readAxiomSummary: not all odd probeset-id end with '-A'")
  anames <- substr(anames, 1, nch-2) #strip off the "-A"
  bnames <- as.character(dat_wide[,1][rep(c(FALSE, TRUE), times=nmrk)])
  nch <- nchar(bnames)
  err <- which(substr(bnames, nch-1, nch) != "-B")
  if (length(err) != 0) stop("readAxiomSummary: not all even probeset-id end with '-B'")
  bnames <- substr(anames, 1, nch-2) #strip off the "-B"
  err <- which (anames != bnames)
  if (length(err) != 0) stop("readAxiomSummary: not all even probeset-id's are paired")
  markernames <- anames; rm(nch, err, anames, bnames)

  Rfile <- paste(out,"meanR.dat", sep="_")
  na.count <- function(x) { sum(is.na(x)) }

  #convert to long format:
  #samplenames <- names(dat_wide[2:length(dat_wide)])
  if (length(markergroups) == 0) {
    #no groups, all markers in one go:
    result <- wide2longAxiom(AXdata=dat_wide, markernames, samplenames,
                             out, rdata=(filetype=="rdata"))
    if (!is.na(out)) {
      sumsR <- tapply(result$R, INDEX=result$SampleName, FUN=sum, na.rm=TRUE)
      nasR <- tapply(result$R, INDEX=result$SampleName, FUN=na.count)
      meanR <- data.frame(SampleName=names(sumsR),
                          meanR=sumsR/(length(markernames)-nasR),
                          missing=nasR)
      writeDatfile(meanR, Rfile)
    }
    return(invisible(result))
  } else {
    #convert per marker group:
    sumsR <- NULL
    result <- list()
    for (mg in seq_along(markergroups)) {
      if (is.numeric(markergroups[[mg]])) {
        #markergroups[[mg]] has the row numbers of this group
        mrknrs <- sort(c(2 * markergroups[[mg]] - 1,
                         2 * markergroups[[mg]]))
        subdat <- dat_wide[mrknrs,]
        submarkernames <- markernames[sort(markergroups[[mg]])]
      } else {
        #markergroups[[mg]] has the marker names of this group
        mrknrs <- which(markernames %in% markergroups[[mg]])
        submarkernames <- markernames[mrknrs] #now in order of markernames
        mrknrs <- sort(c(2 * mrknrs - 1, 2 * mrknrs))
        subdat <- dat_wide[mrknrs,]
      }
      fname <- paste(out, padded(mg, length(markergroups)), sep="_")
      dat <- wide2longAxiom(AXdata=subdat, submarkernames, samplenames,
                            fname, rdata=(filetype=="rdata"))
      if (is.null(sumsR)) {
        sumsR <- tapply(dat$R, INDEX=dat$SampleName, FUN=sum, na.rm=TRUE)
        nasR <- tapply(dat$R, INDEX=dat$SampleName, FUN=na.count)
      } else {
        sumsR <- rowSums(cbind(sumsR,
                               tapply(dat$R, INDEX=dat$SampleName, FUN=sum, na.rm=TRUE)),
                         na.rm=TRUE)
        nasR <- nasR + tapply(dat$R, INDEX=dat$SampleName, FUN=na.count)
      }
    }
    meanR <- data.frame(SampleName=names(sumsR),
                        meanR=sumsR/(length(unlist(markergroups))-nasR),
                        missing=nasR)
    writeDatfile(meanR, Rfile)
    return(invisible(NULL))
  }
} #readAxiomSummary


wide2longAxiom <- function(AXdata, markernames, samplenames,
                           outputfilename, rdata) {
  #helper function for readAxiomSummary, converts one wide data frame to
  #one long data frame.
  #AXdata: data frame or matrix in wide format. If a data frame, then in format
  #      of an AxiomGT1.summary file, containing probeset_id in first column
  #      and samples in the other columns; for each probe only the markers
  #      in this group with X and Y signal intensities (2 rows per
  #      SNP, probeset_id-A or -B in column 1 (not used, use markernames instead)
  #      If matrix: same format but without the probeset_id column
  #samplenames: all sample names in order of columns of AXdata
  #outputfilename: the complete output file name, or NA
  #rdata: TRUE or FALSE
  #Function result: the same data as a long data frame, with columns R and ratio

  #we want:
  #MarkerName SampleName   X   Y
  #M1         S1         111 112
  #M1         S2         121 122
  #M2         S1         211 212
  #M2         S2         221 222
  #M3         S1         311 312
  #M3         S2         321 322

  #from Axiom we have:
  #probeset_id  S1   S2
  #M1-A        111  121
  #M1-B        112  122
  #M2-A        211  221
  #M2-B        212  222
  #M3-A        311  321
  #M3-B        312  322

  #so we take 2 rows at a time, t(ranspose) them and put them under each other
  #(Note that the result of wide2longAxiom is sorted first by MarkerName
  # and within marker by SampleName, in the result of wide2longGS
  # it is the other way round)

  if (is.data.frame(AXdata))
    AXdata <- as.matrix(AXdata[,-1])
  nmrk <- nrow(AXdata) / 2
  nsamp <- ncol(AXdata)
  if (length(markernames) != nmrk) stop("wide2longAxiom: markernames don't match AXdata")
  if (length(samplenames) != nsamp) stop("wide2longAxiom: samplenames don't match AXdata")
  dat <- matrix(0.0, nrow=nmrk * nsamp, ncol=2)
  colnames(dat) <- c("X", "Y")
  if (nmrk > 0 && nsamp > 0) {
    for (r in 1:nmrk) {
      dat[(1 + (r-1) * nsamp) : (r * nsamp),] <-
        t(AXdata[(1 + (r-1) * 2) : (r * 2), ])
    }
  }
  dat <- data.frame(
    MarkerName=rep(markernames, each=nsamp),
    SampleName=rep(samplenames, times=nmrk),
    dat)
  dat$R <- dat$X + dat$Y
  dat$ratio <- dat$Y / dat$R
  #write file:
  if (!is.na(outputfilename)) {
    if (rdata) {
      save(dat, file=paste(outputfilename, ".RData", sep=""))
    } else {
      writeDatfile(dat, file=paste(outputfilename, ".dat", sep=""))
    }
  }
  dat
} #wide2longAxiom

removeNArows <- function(x, column, blanks=TRUE) {
  #x is a data.frame, column is one column number
  #return value: the same data.frame with the rows removed where the
  #column has NA or (if blanks is TRUE) only whitespace.
  #If this column is a factor or character, it will be returned as
  #a character with no starting or trailing whitespace
  if (blanks) {
    if (is.factor(x[,column])) x[,column] <- as.character(x[,column])
    if (is.character(x[,column])) {
      x[,column] <- gsub("(^ +)|( +$)", "", x[,column], fixed=TRUE)
      x[x[,column] == "", column] <- NA
    }
  }
  x[!is.na(x[,column]),]
}

check.sampletable <- function(sampletable,
                              SampleID, CustomerID, Ploidy,
                              SampleID_levels) {
  #used by splitNrenameSamples
  #sampletable: a data frame with at least the columns for SampleID and
  #             CustomerID, and for Ploidy if this is not NULL
  #SampleID, CustomerID, Ploidy: the actual names of these columns in
  #          sampletable, or NULL (only allowed for Ploidy)
  #SampleID_levels: the levels of SampleID that occur in the marker data;
  #                 all of these must also occur in the SampleID column of
  #                 sampletable
  #return value: a string (an error message) if there are problems,
  #              else a vector with the column numbers of SampleID, CustomerID,
  #              Ploidy in sampletable
  sampidcol <- which(names(sampletable) == SampleID)
  if (length(sampidcol) != 1)
    return("sampletable should have one SampleID column")
  custidcol <- which(names(sampletable) == CustomerID)
  if (length(custidcol) != 1)
    return("sampletable should have one CustomerID column")
  #do all sample names in marker data also occur in SampleID column of sampletable?
  SampleID_levels <- as.character(SampleID_levels)
  sampletable[,sampidcol] <- as.character(sampletable[,sampidcol])
  if (anyNA(match(SampleID_levels, sampletable[, sampidcol])))
    return("not all SampleNames in dat occur in sampletable column SampleID")
  if (is.null(Ploidy)) {
    ploidycol <- NA
  } else {
    ploidycol <- which(names(sampletable) == Ploidy)
    if (length(ploidycol) != 1)
      return("sampletable should have one Ploidy column")
    sampletable <- removeNArows(sampletable, ploidycol)
  }
  if (sum(is.na(sampletable[, c(sampidcol, custidcol)])) > 0)
    return("missing data in SampleID or CustomerID columns")
  #are all SampleID and CustomerID for sampletable unique?
  sampletable[, custidcol] <- as.character(sampletable[, custidcol])
  if (length(unique(sampletable[, sampidcol])) != nrow(sampletable))
    return("duplicate names in SampleID column")
  if (length(unique(sampletable[, custidcol])) != nrow(sampletable))
    return("duplicate names in CustomerID column")
  result <- c(sampidcol, custidcol, ploidycol)
  names(result) <- c("sampidcol", "custidcol", "ploidycol")
  result
} #check.sampletable

#'@title Rename samples from array codes to user codes, and split diploid
#'from polyploid samples
#'@description After conversion of array data to fitPoly format, this function
#'replaces the array codes for the samples to user codes, and it splits the
#'data into a part for diploid and one for polyploid samples if both are present.
#'@usage splitNrenameSamples(dat, sampletable, SampleID, CustomerID,
#'Ploidy=NULL, out, filetype=c("dat","RData")[2])
#'@param dat a data frame in "long" format as returned by readFullDataTable or
#'readAxiomSummary
#'@param sampletable a data frame with at least the columns for SampleID,
#'CustomerID and Ploidy (these columns may have any name, specified by the
#'next parameters)
#'@param SampleID the title of the column in sampletable containing the codes
#'for the samples in the array file - there must be a one-to-one relation
#'between the array codes and the user codes!
#'@param CustomerID the title of the column in sampletable containing the
#'user codes for the samples - there must be a one-to-one relation between the
#'array codes and the user codes!
#'@param Ploidy the title of the column in sampletable containing the ploidy
#'of the samples; default NULL means no split based on ploidy must be done. If not
#'NULL, for each different ploidy a separate data.frame will be created and
#'samples with ploidy missing or "" will be omitted from all files.
#'@param out the base filename of the output files; to this will be appended the
#'the ploidy level(s) if Ploidy is not NULL, and an extension .dat or .RData,
#'depending on filetype). Setting out to NA or "" results in no output file(s)
#'being created
#'@param filetype either "dat" or "RData" (default): the former produces tab-
#'separated text files (one for each element of the return value),
#'the latter saves one RData file containing a list named dat, identical
#'to the return value
#'@return if Ploidy not specified, the original dat data.frame with substituted
#'SampleNames; if Ploidy specified a list with one
#'element for each different value in the Ploidy column; the names of the
#'elements are then these ploidy values. Each element is a subset of the
#'original dat data.frame containing only the rows with samples of that ploidy,
#'and with substituted SampleNames
#'@export
splitNrenameSamples <- function(dat, sampletable,
                                SampleID, CustomerID, Ploidy=NULL,
                                out, filetype=c("dat","RData")[2]) {
  if (!(leftstr(tolower(filetype), 3) %in% c("dat", "rda")))
    stop("invalid filetype")
  samplecols <-
    check.sampletable(sampletable, SampleID, CustomerID, Ploidy,
                      SampleID_levels=as.character(unique(dat$SampleName)))
  if (is.character(samplecols))
    stop(samplecols)
  sampcol <- samplecols[1]; customercol <- samplecols[2]; ploidycol <- samplecols[3]
  if (missing(out)) stop("out must be specified")
  if (length(out) != 1) stop("out must be one filename or NA")
  if (!is.na(out)) {
    if (!is.character(out)) {
      stop("out must be one filename")
    }
    if (is.na(out) || out == "") {
      out <- NA
    } else {
      if (!check.filename(paste(out,"714x996y8.xxx",sep="_"))) {
        stop("out is not valid")
      }
    }
  }

  if (!is.na(ploidycol)) {
    sampletable <- removeNArows(sampletable, ploidycol)
    ploi <- sort(unique(sampletable[, ploidycol]))
    nas <- anyNA(ploi)
    ploi <- ploi[!is.na(ploi)]
    cat(paste("sampletable contains ploidy levels", paste(ploi, collapse=" ")))
    if (nas) cat(" and also contains missing ploidy values")
    cat("\n")
    cat("the samples will be split per ploidy value")
    if (nas) cat(" and samples without ploidy will be omitted")
    cat("\n")
    # reply <- readline(prompt="Proceed (Y/N) ?")
    # reply <- tolower((reply))
    # if (!(reply %in% c("y", "yes"))) stop("splitNrenameSamples: user abort")
  }
  #checks against NA or duplicates in SampleID and CustomerID are already done in
  #check.sampletable, and also that all dat$SampleName values are in SampleID

  #rename the dat$SampleName from SampleID to CustomerID:
  dat$SampleName <- factor(dat$SampleName)
  sampnames <- levels(dat$SampleName)
  levels(dat$SampleName) <-
    sampletable[match(sampnames, sampletable[, sampcol]), customercol]

  #split different ploidy levels:
  if (is.null(Ploidy)) {
    #no split
    result <- dat
  } else {
    result <- list()
    for (p in seq_along(ploi)) {
      samp <- sampletable[sampletable[,samplecols[3]] == ploi[p], samplecols[2]]
      result[[p]] <- dat[dat$SampleName %in% samp,]
      result[[p]]$SampleName <- factor(result[[p]]$SampleName) #drop unused levels
      names(result)[p] <- ploi[p]
    }
  }

  dat <- result
  #save file(s):
  if (!is.na(out)) {
    if (leftstr(tolower(filetype), 3) == "rda") {
      save(dat, file=paste(out, "RData", sep="."))
    } else {
      for (p in seq_along(dat)) {
        fname <- paste(out, "_", names(result)[p], ".dat", sep="")
        writeDatfile(dat[[p]], fname)
      }
    }
  }
  invisible(dat)
} #splitNrenameSamples

# check distribution of R per SNP *********************************************

#'@title Calculate statistics of R per marker
#'@description Calculate the min, max, mean and 50% and 95% quantiles of R
#'for each marker
#'@usage calcRstats(datpoly, out)
#'@param datpoly a data frame in long format (input format for fitPoly)
#'with at least columns MarkerName, SampleName, R
#'@param out name of output file; if NA no file is written. File contains
#'the same data as the return value
#'@details The data.frame returned by this function is used as input for
#'function selMarkers_byR
#'@return a data frame with columns MarkerName, mean, min, q50, q95, max
#'which are all statistics of the R values per marker
#'@export
calcRstats <- function(datpoly, out) {
  stats <- function(x) { #function stats nested within calcRstats
    y <- x[!is.na(x)]
    result <- c(mean(y), quantile(y, probs=c(0, 0.50, 0.95, 1), names=FALSE))
  }

  Rdist <- tapply(datpoly$R, INDEX=datpoly$MarkerName, FUN=stats)
  Rdist <- do.call(rbind, Rdist)
  colnames(Rdist) <- c("mean", "min", "median", "q95", "max")
  Rdist <- data.frame(MarkerName=rownames(Rdist), Rdist)
  rownames(Rdist) <- NULL
  if (!is.na(out) && out!="")
    writeDatfile(Rdist, out)
  invisible(Rdist)
} #calcRstats

#'@title Select markers at specified R levels
#'@description Select markers based on their R statistics, for studying
#'the relation between R level and marker quality
#'@usage selMarkers_byR(Rstats, Rlevels, mrkperlevel=1, stat="q95")
#'@param Rstats a data frame as returned by calcRstats
#'@param Rlevels a vector of R levels in increasing order (preferably from
#'the minimum to the maximum of stat over all markers in Rstats)
#'@param mrkperlevel number, default 1: the number of markers to select at
#'each of the levels in Rlevels
#'@param stat the name of one of the statistics columns in Rstats
#'@details The return value of this function is intended to be used for
#'studying the relation between the marker quality and the value of the
#'chosen R statistic, e.g. by drawing XY-plots of each of the selected
#'markers. By finding a suitable threshold for the R statistic bas markers
#'could be excluded from evaluation by fitPoly, saving time because bad
#'markers take the longest to be scored and are then often rejected anyway.
#'@return a selection from Rstats, in ascending order of column stat.
#'The first <mrkperlevel> markers with stat above each Rlevel are returned;
#'duplicated selections are removed (so the number of returned markers may be
#'less than length(Rlevels) * mrkperlevel)
#'@export
selMarkers_byR <- function(Rstats, Rlevels, mrkperlevel=1, stat="q95") {
  #Rstats: a data frame as returned by calcRstats
  #Rlevels: a numeric vector with the levels of R at which markers should be
  #selected
  #mrkperlevel: the number of markers to select per level
  #stat: the column name of the R statistic to use
  #return value: a selection from Rstats, in ascending order of column stat.
  #              The first <mrkperlevel> markers with stat above each Rlevel
  #              are returned; duplicated selections are removed (so the
  #              number of returned markers may be less than
  #              length(Rlevels) * mrkperlevel)
  statcol <- which(names(Rstats) == stat)
  if (length(statcol) != 1)
    stop(paste("column ", stat, "missing"))
  Rstats <- Rstats[order(Rstats[,statcol]),]
  Rlevels <- sort(Rlevels)
  firstsel <- integer(0)
  for (i in 1:length(Rlevels)) {
    seli <- which(Rstats[,statcol] >= Rlevels[i])
    if (length(seli) == 0) firstsel[i] <- NA else
    firstsel[i] <- min(seli)
  }
  sel <- integer(0)
  for (i in 1:mrkperlevel) sel <- c(sel, firstsel + i - 1)
  sel <- sort(sel[!is.na(sel)])
  sel <- sel[sel <= nrow(Rstats)]
  sel <- sel[!duplicated(sel)]
  Rstats[sel,]
} #selMarkers_byR

# XY-plotting *****************************************************************

get.genocol.old <- function() {
  #return value: a set of 6 pastel colors to represent dosages,
  #from light red (nulliplex) to light blue (quadruplex) and grey for missing
  c(rgb(red=255,green=170,blue=170,maxColorValue=255), #1: light red
    rgb(red=255,green=211,blue=155,maxColorValue=255), #2: "burlywood1", lightred-orange
    rgb(red=250,green=187,blue=255,maxColorValue=255), #3: light bluish magenta
    rgb(red=152,green=245,blue=255,maxColorValue=255), #4: "cadetblue1", light bluish cyan
    rgb(red=180,green=180,blue=255,maxColorValue=255), #5: light blue
    "grey")                                            #6: NA
} #get.genocol.old

#'@title Generate a set of genotype (dosage) colors for XY-plots
#'@description For each of the dosages 0 to ploidy a color is generated,
#'and an additional one for samples with no dosage assigned.
#'@usage get.genocol(ploidy, pastel=TRUE)
#'@param ploidy any ploidy level >= 2
#'@param pastel if TRUE (default) light (pastel) colors are generated, else more
#'intense colors
#'@details the colors range from red (first item, dosage 0) via blue
#'(dosage ploidy/2) to green (dosage ploidy); the color for missing dosage is
#'grey
#'@return A vector of <ploidy + 2> color values; items 1 .. ploidy+1 for
#'dosage 0 .. ploidy and item ploidy+2 for missing dosages
#'@export
get.genocol <- function(ploidy, pastel=TRUE) {
  #return value: a set of (ploidy+1) colors to represent dosages,
  #from red (item 1, nulliplex) via blue (dosage ploidy/2) to green (item ploidy+1, dosage=ploidy), and grey (item ploidy+1) for
  #missing values
  #This order of colors was chosen because the transition colors between
  #red and green, which are not very clear, are avoided
  minint <- ifelse(pastel, 160, 0) #minimum intensity per color for pastels
  oddploidy <- ploidy %% 2 == 1
  if (oddploidy) ploidy <- 2 * ploidy
  RGB <- matrix(0, nrow=ploidy+2, ncol=3)
  colnames(RGB) <- c("red", "green", "blue")
  #RGB[1,] <-          c(   255  , minint,   minint) #red, dosage=0
  #RGB[ploidy/2+1,] <- c(minint/2,    255, minint/2) #green, dosage=ploidy/2
  #RGB[ploidy+1,] <-   c(minint  , minint,      255) #blue, dosage=ploidy
  RGB[1,] <-          c(255,      minint,        minint)              #red, dosage=0
  RGB[ploidy/2+1,] <- c(minint  , minint,        255)                 #blue, dosage=ploidy/2
  RGB[ploidy+1,] <-   c(minint/2, ifelse(pastel, 255, 200), minint/2) #green, dosage=ploidy
  RGB[ploidy+2,] <- col2rgb("grey")                                   #grey, dosage NA
  if (ploidy > 2) for (p in 2:(ploidy/2)) {
    #between nulliplex (red) and ploidy/2 (green): orange-brownish
    fact <- (p-1)/(ploidy/2)
    RGB[p,] <- RGB[1,] + round(fact * (RGB[ploidy/2+1,] - RGB[1,]))
  }
  if (ploidy > 2) for (p in (ploidy/2+2):(ploidy)) {
    #between ploidy/2 (green) and ploidy (blue): cyan-ish
    fact <- (p-(ploidy/2+1))/(ploidy/2)
    RGB[p,] <- RGB[ploidy/2+1,] + round(fact * (RGB[ploidy+1,] - RGB[ploidy/2+1,]))
  }
  if (oddploidy) RGB <- RGB[c(seq(1, ploidy+1, by=2), nrow(RGB)),]
  result <- apply(RGB, 1, function(x) rgb(red=x[1], green=x[2], blue=x[3],
                                          maxColorValue=255))
} #get.genocol

calc.R.threshold <- function(Rquantile, Rfactor=0.5, lowR=0) {
  #Rquantile: a vector of R quantile values per marker (e.g. 95% quantiles)
  #R threshold is Rfactor * Rquantile, but at least lowR
  #This function is used by makeFitPolyFiles and drawXYplots
  result <- Rfactor * Rquantile
  result[result<lowR] <- lowR
  result
} #calc.R.threshold

#'@title Draws an XY-plot showing allele signals and assigned dosages
#'@description Draws an XY-plot for one markers showing the X and Y signals
#'of each sample and their assigned dosages
#'@usage XY_plot(title="", XYdat, shift=0, ploidy=NULL,
#'genocol="grey", pch=1, cex=1,
#'sel.samples=as.character(unique(XYdat$SampleName)), omit.pch=".",
#'omit.col=NULL, sample.groups=list(), groups.col="black", groups.pch=1,
#'groups.cex=1, groups.rnd=FALSE, R.thresholds=NA, R.col="black", R.lty=1)
#'@param title the main title above the plot
#'@param XYdat a data frame with at least columns SampleName, X and Y;
#'column geno (if present) is also used. Contains data for all samples, one SNP.
#'geno is the dosage (0 .. <ploidy>), with NA or <ploidy+1> for missing dosage
#'info
#'@param shift a single integer, default 0: by how much should the geno
#'be shifted?
#'@param ploidy a single integer specifying the ploidy; default NA, only
#'needed if there is a column geno in XYdat
#'@param genocol a vector of color values to be used for plotting
#'the sel.samples according on their geno (dosage) value; if only one value is
#'given (default) all samples are plotted in that color
#'@param pch the plot character to plot the sel.samples; default 1 is an
#'open circle
#'@param cex the relative size of the sample symbols
#'@param sel.samples character vector (not a factor) with the names of the
#'important samples: those that must be plotted in colors genocol and symbol pch
#'@param omit.pch the plot character to use for the other samples, default a dot
#'@param omit.col vector of two colors to use for the other samples; the first
#'color is used for sample with a geno (dosage) value, the second color for
#'unscored samples; recycled if needed, with NA (default) the non-selected
#'samples are not plotted
#'@param sample.groups a list specifying samples to be highlighted in a
#'different color and/or symbol and/or size. For each group of samples the list
#'has one vector of sample names; the list may also be empty
#'@param groups.col a vector or color values, one for each item (vector of
#'sample names) in sample.groups; recycled if shorter than sample.groups
#'@param groups.pch a vector of plot symbols, one for each item in
#'sample.groups; recycled if shorter than sample.groups
#'@param groups.cex a vector of relative symbol sizes, one for each item in
#'sample.groups; recycled if shorter than sample.groups
#'@param groups.rnd FALSE (default) or TRUE. If FALSE, all samples in
#'sample.groups are drawn in the order in which they appear in XYdat; if TRUE
#'they are drawn in a random order. Note that the samples are never drawns in
#'group order (except is the samples are already in group order in XYdat and
#'groups.rnd=FALSE).
#'@param R.thresholds a vector of thresholds for R to plot; is NA (default)
#'no R thresholds are plotted
#'@param R.col a vector of color values for drawing R thresholds, one for
#'each value of R.thresholds; recycled if needed
#'@param R.lty a vector of line types for drawing the R thresholds, one for
#'each value of R.thresholds; recycled if needed
#'@return The function produces an XY-plot and returns NULL
#'@export
XY_plot <- function(title="", #the main title above the plot
                    XYdat, #the data, with or without geno column, for 1 SNP,
                    shift=0,
                    ploidy=NULL,
                    genocol="grey", #if a vector with colors then it should have
                    #               colors for geno 0 .. ploidy and one color for
                    #               missing geno values; i.e. its length should
                    #               be either 1 or ploidy+2
                    pch=1, #point character, 1 is open circle
                    cex=1, #relative size, 1 is default
                    sel.samples=as.character(unique(XYdat$SampleName)), #the important samples
                    omit.pch=".", #the pch for the other samples
                    omit.col=NULL, #the colors for the other samples
                    sample.groups=list(), #groups of samples to highlight
                    groups.col="black", #a color for each group
                    groups.pch=1, #a symbol for each group
                    groups.cex=1, #relative size of symbols for each group
                    groups.rnd=FALSE, #randomize order in which group samples are drawn
                    R.thresholds=NA, #thresholds for R (default must ne NA, not NULL!)
                    R.col="black",   #colors of the R thresholds
                    R.lty=1) {  # line type of R thresholds(1=solid, 2=dashed, 3=dotted)
  #XYdat: a data frame with at least columns SampleName, X and Y; column geno
  #       (if present) is also used. Contains data for all samples, one SNP.
  #       geno is the dosage (0 is nulliplex .. <ploidy> is <ploidy>plex),
  #       with NA or <ploidy+1> for missing dosage info
  #Produces a single XY plot of signal intensities, for one marker.
  #First all samples are plotted either all in one color (grey by default) or
  #with different colors for each dosage (if a column geno is present in XYdat)
  #On top of this, several groups of samples can be plotted (again), with one
  #color per group; e.g. to highlight parents, subpopulations etc)
  #Also the symbol and the size of the symbol to print for the backgound
  #samples and the samples in each group can be specified.
  #Finally different R threshold values can be indicated by lines, of which the
  #color and line type can be specified.
  max.xy <- max(c(XYdat$X, XYdat$Y), na.rm=TRUE)
  if (is.na(max.xy) || max.xy <= 0) {
    plot(0~0, col="white", main=title, xlim=c(0,1), ylim=c(0,1))
  } else {
    if ("geno" %in% names(XYdat)) {
      if (is.null(ploidy)) stop("XY_plot: ploidy must be specified")
      mvgeno <- ploidy+1
      if (!all(XYdat$geno %in% c(0:mvgeno, NA)))
        stop("XY_plot: geno contains invalid values")
      XYdat$geno[XYdat$geno == mvgeno] <- NA
      XYdat$geno <- shiftdosages(XYdat$geno, shift, ploidy)
      XYdat$geno[is.na(XYdat$geno)] <- mvgeno
      genocol <- rep(genocol, length.out=ploidy+2)
    } else {
      #no geno column
      mvgeno <- length(genocol) - 1 #the last value in genocol represents missing genotype
      XYdat$geno <- rep(mvgeno, nrow(XYdat)) #all unknown
    }
    #at this point there are no samples with geno NA left,
    #all those that were NA have been assigned mvgeno
    if (is.null(omit.col)) {
      #then we make all genotyped omitted samples black and the
      #non-genotyped omitted samples grey:
      omit.col <- c(rep("black", length(genocol)-1), "grey")
    } else omit.col <- rep(omit.col, length.out=length(genocol))
    if (is.factor(XYdat$SampleName)) {
      omit.samples <- setdiff(levels(XYdat$SampleName), sel.samples)
    } else omit.samples <- setdiff(unique(XYdat$SampleName), sel.samples)
    # first make the plot with all omitted samples in omit.col:
    # (those with missing geno first, then those with nonmissing geno)
    XY.nogeno <- XYdat[XYdat$geno == mvgeno,, drop=FALSE]
    XY.geno <- XYdat[XYdat$geno < mvgeno,, drop=FALSE]
    XYsub <- XY.nogeno[XY.nogeno$SampleName %in% omit.samples,, drop=FALSE]
    plot(XYsub$Y ~ XYsub$X,
         col=omit.col[XYsub$geno + 1], pch=omit.pch, cex=cex,
         main=title,
         xlim=c(0, max.xy), ylim=c(0, max.xy), xlab="X", ylab="Y")
    XYsub <- XY.geno[XY.geno$SampleName %in% omit.samples,, drop=FALSE]
    points(XYsub$Y ~ XYsub$X,
           col=omit.col[XYsub$geno + 1], pch=omit.pch, cex=cex)
    # next make the plot with all non-omitted samples in genocol:
    # (those with missing geno first, then those with nonmissing geno)
    XYsub <- XY.nogeno[XY.nogeno$SampleName %in% sel.samples,, drop=FALSE]
    points(XYsub$Y ~ XYsub$X,
           col=genocol[XYsub$geno + 1], pch=pch, cex=cex)
    XYsub <- XY.geno[XY.geno$SampleName %in% sel.samples,, drop=FALSE]
    points(XYsub$Y ~ XYsub$X,
           col=genocol[XYsub$geno + 1], pch=pch, cex=cex)
    #next, draw the groups of samples
    if (is.list(sample.groups) &&
        length(sample.groups) > 0) {
      groups.col <- rep(groups.col, length.out=length(sample.groups))
      groups.pch <- rep(groups.pch, length.out=length(sample.groups))
      groups.cex <- rep(groups.cex, length.out=length(sample.groups))
      # old version: draw groups successively, last group on top
      #for (grp in seq_along(sample.groups)) {
        #XYsub <- XYdat[XYdat$SampleName %in% sample.groups[[grp]],]
        #points(XYsub$Y ~ XYsub$X,
        #       col=groups.col[grp], pch=groups.pch[grp], cex=groups.cex[grp])
      #}
      # new version 20180219: draw all at the same time
      # (order of points is order in the XYdat data frame,
      # there is no group order with one obscuring the other)
      XYsub <- XYdat[XYdat$SampleName %in% unlist(sample.groups),, drop=FALSE]
      samp.col <- samp.pch <- samp.cex <- numeric(nrow(XYsub))
      for (grp in seq_along(sample.groups)) {
        grpsamp <- XYsub$SampleName %in% sample.groups[[grp]]
        samp.col[grpsamp] <- groups.col[grp]
        samp.pch[grpsamp] <- groups.pch[grp]
        samp.cex[grpsamp] <- groups.cex[grp]
      }
      samporder <- seq_len(nrow(XYsub))
      if (groups.rnd) {
        samporder <- sample(samporder)
      }
      points(XYsub$Y[samporder] ~ XYsub$X[samporder], col=samp.col[samporder],
             pch=samp.pch[samporder], cex=samp.cex[samporder])
    }
    #next, draw the R.threshold lines at X + Y = R.threshold:
    R.col <- rep(R.col, length.out=length(R.thresholds))
    R.lty <- rep(R.lty, length.out=length(R.thresholds))
    for (i in 1:length(R.thresholds)) {
      lines(c(0, R.thresholds[i]) ~ c(R.thresholds[i], 0),
            col=R.col[i], lty=R.lty[i]) #no line if R.threshold[i] is NA
    }
  }
  invisible(NULL)
} #XY_plot

#'@title Draws a series of pages, each with 6 XY-plots showing allele signals
#'and assigned dosages
#'@description Draws 6 XY-plots per page for a series of markers; each XY-plot is
#'drawn by function XY_plot
#'@usage drawXYplots(dat, markers=NA, out, genocol="grey", pch=1, cex=1,
#'sel.samples=as.character(unique(XYdat$SampleName)), omit.pch=".",
#'omit.col=c(rep("black", length(genocol)-1), "grey"), sample.groups=list(),
#'groups.col="black", groups.pch=1, groups.cex=1, groups.rnd=FALSE,
#'R.col="black", R.lty=1, drawRthresholds=FALSE,
#'Rthreshold.param=c(0.95, 0.5, 0), ploidy)
#'@param dat a data.frame with at least columns MarkerName, SampleName, X and Y;
#'column geno (if present) is also used
#'@param markers either a vector with names of markers to plot or a
#'data.frame with at least column MarkerName containing the names of the markers
#'to plot; default NA means all markers in dat. If markers is a data.frame that
#'also has a column shift, the geno values in dat will be shifted accordingly.
#'This allows to use (a selection from) the output of checkF1 (with or without
#'parameter shiftmarkers) as input for drawXYplots.
#'@param out base for filenames of output, will be extended with
#'_pagenumber.png; may include a path, but the directory where the plot files
#'are to be saved must already exist
#'@param genocol a vector of color values to be used for plotting
#'the sel.samples according on their geno (dosage) value; if only one value is
#'given (default) all samples are plotted in that color
#'@param pch the plot character to plot the sel.samples; default 1 is an
#'open circle
#'@param cex the relative size of the sample symbols
#'@param sel.samples character vector (not a factor) with the names of the
#'important samples: those that must be plotted in colors genocol and symbol pch
#'@param omit.pch the plot character to use for the other samples, default a dot
#'@param omit.col vector of two colors to use for the other samples; the first
#'color is used for sample with a geno (dosage) value, the second color for
#'unscored samples
#'@param sample.groups a list specifying samples to be highlighted in a
#'different color and/or symbol and/or size. For each group of sample the list
#'has one vector of sample names; the list may also be empty
#'@param groups.col a vector or color values, one for each item (vector of
#'sample names) in sample.groups; recycled if shorter than sample.groups
#'@param groups.pch a vector of plot symbols, one for each item in
#'sample.groups; recycled if shorter than sample.groups
#'@param groups.cex a vector of relative symbol sizes, one for each item in
#'sample.groups; recycled if shorter than sample.groups
#'@param groups.rnd FALSE (default) or TRUE. If FALSE, all samples in
#'sample.groups are drawn in the order in which they appear in XYdat; if TRUE
#'they are drawn in a random order. Note that the samples are never drawns in
#'group order (except if the samples are already in group order in XYdat and
#'groups.rnd=FALSE).
#'@param R.col a vector of color values for drawing R thresholds, one for
#'each value of Rthreshold.param; recycled if needed
#'@param R.lty a vector of line types for drawing the R thresholds, one for
#'each value of Rthreshold.param; recycled if needed
#'@param drawRthresholds FALSE (default) or TRUE: whether R thresholds should be
#'drawn
#'@param Rthreshold.param either a list of vectors each of length 3 (for
#'multiple R thresholds) or one vector of length 3 (for one R threshold). Each
#'vector defines one R threshold and is based on a specified quantile of the
#'distribution of R values for the current marker. The first number in each
#'vector is the R quantile, the second is a number to multiply that R quantile
#'with, and the third is the minimum value of the result. The default of
#'c(0.95, 0.5, 0) means that 0.5 * the 95% quantile of R is used (which is
#'always higher than 0, the minimum result). This is often a good cut-off
#'value to discard samples, or to signal markers with many samples below that
#'value.
#'@param ploidy a single integer specifying the ploidy, only needed if
#'dat contains a column geno
#'@return The function produces a series of pages with plots and returns NULL
#'@export
drawXYplots <- function(dat, markers=NA, out,
                        genocol="grey", #colors for dosage 0-ploidy, NA=ploidy+1
                        pch=1, #point character, 1 is open circle
                        cex=1, #relative size, 1 is default
                        sel.samples=as.character(unique(XYdat$SampleName)),
                        omit.pch=".", #point character for omitted samples
                        omit.col=c(rep("black",length(genocol)-1),"grey"),
                        sample.groups=list(), #groups of samples to highlight
                        groups.col="black", #a color for each group
                        groups.pch=1, #a symbol for each group
                        groups.cex=1, #relative size of symbols for each group
                        groups.rnd=FALSE, #randomize order in which group samples are drawn
                        R.col="black",   #colors of the R thresholds
                        R.lty=1,
                        drawRthresholds=FALSE,
                        Rthreshold.param=c(0.95, 0.5, 0),
                        ploidy) {
  #Draws a series of pages with 6 XYplots each, one XYplot per marker in
  #markers. The individual plots are drawn by function XY_plot.
  #dat: a data frame with at least columns MarkerName, SampleName, X and Y;
  #     column geno (if present) is also used.
  #markers: a vector with names of markers to plot, or marker numbers;
  #         default NA means all
  #out: base for filenames of output, will be extended with _pagenumber.png
  #drawRthresholds: if TRUE, R threshold lines are drawn according to:
  #Rthreshold.param: either a list of vectors each of length 3 (for multiple
  #                  R thresholds) or one vector of length 3 (for one R threshold).
  #                  the first element of the vector is the R quantile P-value,
  #                  the 2nd and 3rd are Rfactor and lowR as in
  #                  calc.R.threshold.
  #The other parameters are as in XY_plot

  #close devices - extra strong:
  allDevOff()
  allmarkers <- as.character(unique(dat$MarkerName))
  if (length(allmarkers) == 0) stop("no markers in dat")
  if ("geno" %in% names(dat) && missing(ploidy))
    stop("ploidy must be specified")
  if (is.data.frame(markers)) {
    #markernames must be in a column MarkerName, and if there is a column
    #shift, this shift is retained
    #This new (20170113) alternative allows to supply a selection of rows from
    #a checkF1 output to specify markers and possibly shifts
    if (!("MarkerName" %in% names(markers)))
      stop("no column 'MarkerName' in markers data.frame")
    if (!is.character(markers$MarkerName) && !is.factor(markers$MarkerName))
      stop("column Markername of markers must contain marker names")
    notNA <- !is.na(markers$MarkerName)
    if (sum(notNA) == 0) {
      #no markers specified, use all:
      mrkdf <- data.frame(MarkerName=allmarkers, shift=0, stringsAsFactors=FALSE)
    } else {
      if ("shift" %in% names(markers)) {
        if (!("geno" %in% names(dat))) stop("markers has a column 'shift' but dat has no column 'geno'")
        if (anyNA(markers$shift[notNA])) stop("no missing values allowed in column 'shift' of markers data.frame")
        if (!is.numeric(markers$shift) || any(as.integer(markers$shift) != markers$shift))
          stop("column 'shift' of markers data.frame must contain integer shift values")
        if (!is.null(ploidy) && !all(markers$shift %in% (-ploidy):ploidy))
          stop("invalid values in column 'shift' of markers data.frame")
        mrkdf <- data.frame(MarkerName=trim_shf(markers$MarkerName[notNA]),
                            shift=markers$shift[notNA], stringsAsFactors=FALSE)
      } else mrkdf <- data.frame(MarkerName=trim_shf(markers$MarkerName[notNA]),
                                 shift=0, stringsAsFactors=FALSE)
    }
  } else {
    #markers not a data.frame
    if (is.list(markers) || !is.vector(markers))
      stop("markers must be a character vector of marker names, or a data.frame with a column MarkerName")
    #now markers is a vector
    notNA <- !is.na(markers)
    if (sum(notNA) == 0) {
      #no markers specified, use all:
      mrkdf <- data.frame(MarkerName=allmarkers, shift=0, stringsAsFactors=FALSE)
    } else {
      if (!is.character(markers) && !is.factor(markers))
        stop("markers must contain marker names")
      mrkdf <- data.frame(MarkerName=trim_shf(markers[notNA]), shift=0, stringsAsFactors=FALSE)
    }
  }
  #now the mrkdf data.frame has been created; exclude markers not in allmarkers (but retain duplicated):
  pres <- mrkdf$MarkerName %in% allmarkers
  if (sum(pres) < nrow(mrkdf)) {
    if (sum(pres) == 0) stop("none of the specified markers occurs in dat")
    mrkdf <- mrkdf[pres,]
  }

  #create small subset of dat to select markers from:
  if (nrow(mrkdf) < 0.5 * length(allmarkers)) {
    dat <- dat[dat$MarkerName %in% mrkdf$MarkerName,]
  }
  maxpage <- ceiling(nrow(mrkdf) / 6)
  m.start <- 1
  m.end <- nrow(mrkdf)
  nr_on_page <- 0
  for (m in m.start:m.end) {
    mrknr <- padded(m, m.end)
    nr_on_page <- nr_on_page+1
    if (nr_on_page == 7) nr_on_page <- 1
    if (nr_on_page == 1) {
      pagenr <- padded((m-1) %/% 6 + 1, maxpage)
      png(paste(out, "_", pagenr, ".png", sep=""),
          width=7.2, height=11.7, units="in", res=96)
      layout(matrix(1:6, byrow=TRUE, ncol=2)) # 3*2 plots on page
    }
    XYdat <- dat[dat$MarkerName == mrkdf$MarkerName[m],]
    if ("R" %in% names(XYdat)) XYR <- XYdat$R else XYR <- XYdat$X + XYdat$Y
    if (drawRthresholds) {
      if (!is.list(Rthreshold.param)) Rthreshold.param <- list(Rthreshold.param)
      R.thresholds <- rep(NA, length(Rthreshold.param))
      for (i in seq_along(Rthreshold.param)) {
        Rquantile <- quantile(XYR, Rthreshold.param[[i]][1], na.rm=TRUE)
        R.thresholds[i] <- calc.R.threshold(Rquantile,
                                            Rthreshold.param[[i]][2],
                                            Rthreshold.param[[i]][3])
      }
    }
    else R.thresholds <- NA
    XY_plot(title=paste(mrknr, mrkdf$MarkerName[m]), XYdat=XYdat,
            shift=mrkdf$shift[m], ploidy=ploidy,
            genocol=genocol, pch=pch, cex=cex,
            sel.samples=sel.samples, omit.pch=omit.pch, omit.col=omit.col,
            sample.groups=sample.groups, groups.col=groups.col,
            groups.pch=groups.pch, groups.cex=groups.cex, groups.rnd=groups.rnd,
            R.thresholds=R.thresholds, R.col=R.col, R.lty=R.lty)

    #if last plot on page, close device:
    if (nr_on_page == 6 || m >= m.end) {
      #close devices - extra strong:
      allDevOff()
    }
  } # for m
  invisible(NULL)
} #drawXYplots

allDevOff <- function() {
  #close all graphics devices - extra strong
  #dvof <- 2
  #while (!is.null(dvof) && dvof > 1) dvof <- dev.off()
  while (!is.null(dev.list())) dev.off()
}

# fitPoly input and output files **********************************************

#'@title Make input files for fitPoly containing only selected rows and columns
#'@description Remove entire markers and ratios for samples within markers
#'from fitPoly inputfiles if their R levels are below some thresholds;
#'also keep only the MarkerName, SampleName, X, Y, R and ratio columns
#'@usage makeFitPolyFiles(datpoly, datdiplo=NA, out,
#'filetype=c("dat", "RData")[2], Rquantile=0.95, marker.threshold,
#'Rthreshold.param=c(0.95, 0.5, 0))
#'@param datpoly data.frame as produced by readFullDataTable, readAxiomSummary
#'or splitNrenameSamples, with at least columns MarkerName, SampleName, R and
#'ratio
#'@param datdiplo data.frame as produced by splitNrenameSamples, with at least
#'columns MarkerName, SampleName, R and ratio, containing the data for the
#'diploid samples; or NA (default)
#'@param out output files will be named out + _poly.RData and _diplo.RData
#'(or .dat if filetype is set to "dat"). If NA no files are written
#'@param filetype "RData" (default) or "dat": the format to save the data in
#'@param Rquantile a value between 0 and 1: the R (= X + Y) quantile on which
#'selection of markers is based
#'@param marker.threshold the minimum value of the Rquantile for a marker to be
#'selected
#'@param Rthreshold.param a vector of length 3, defining an R threshold for
#'selecting samples within each marker.
#'The first number is the R quantile, the second is a number to multiply that
#'R quantile with, and the third is the minimum value of the result.
#'The default of c(0.95, 0.5, 0) means that 0.5 * the 95% quantile of R is used
#'(which is always higher than 0, the minimum result). This is often a good
#'cut-off value to discard samples within a marker.
#'@details All rows for markers not meeting the marker.threshold for the
#'R quantile are removed from the original data.frames. For samples within
#'the remaining markers where the R value is below the marker-specific threshold
#'the ratio is set to NA, the row for that sample is not removed.
#'The same thresholds are applied to datpoly and datdiplo.
#'@return a list with two elements: datpoly is a filtered version of
#'parameter datpoly, and if param datdiplo was specified the second element
#'of the returned list is datdiplo: a filtered version of parameter datdiplo.
#'Either or both may have 0 rows left after filtering. If no datdiplo was
#'specified the second element of the list is NA.\cr
#'If parameter out is specified the element(s) of this list are also saved
#'as files.
#'@export
makeFitPolyFiles <- function(datpoly, datdiplo=NA,
                             out, filetype=c("dat", "RData")[2],
                             Rquantile=0.95, marker.threshold,
                             Rthreshold.param=c(0.95, 0.5, 0)) {

  cols <- match(c("MarkerName", "SampleName", "R", "ratio"),
                names(datpoly))
  if (anyNA(cols))
    stop("required columns missing from datpoly")
  savediplo <- is.data.frame(datdiplo)
  if (savediplo) {
    cols <- match(c("MarkerName", "SampleName", "R", "ratio"),
                  names(datdiplo))
    if (anyNA(cols))
      stop("required columns missing from datdiplo")
  }
  out <- out[1]
  if (!is.na(out)) {
    testfilename <- ifelse(filetype=="dat",
                           paste(out,"poly.dat", sep="_"),
                           paste(out,"poly.RData", sep="_"))
    if (!check.filename(testfilename))
      stop("out is not valid")
  }

  quant <- function(x, Rquantile) { #function nested within make.fitPoly.files
    quantile(x, probs=Rquantile, na.rm=TRUE, names=FALSE)
  }

  Rq <- tapply(datpoly$R, INDEX=datpoly$MarkerName, FUN=quant, Rquantile)

  #remove all markers where Rq is below threshold,
  #and retain only relevant columns:
  #(works well even if marker names are not valid identifiers)
  selmarkers <- names(Rq)[!is.na(Rq) & Rq >= marker.threshold]
  #remove these markers from datpoly:
  datpoly <- datpoly[datpoly$MarkerName %in% selmarkers,]
  datpoly$MarkerName <- factor(datpoly$MarkerName) #drop unused levels
  #remove same markers from Rq:
  Rq <- Rq[!is.na(Rq) & Rq >= marker.threshold]
  #calculate R threshold for samples within each remaining marker:
  if (Rthreshold.param[1] == Rquantile) Sq <- Rq else {
    Sq <- tapply(datpoly$R, INDEX=datpoly$MarkerName, FUN=quant,
                 Rquantile=Rthreshold.param[1])
  }
  Sq <- calc.R.threshold(Sq, Rthreshold.param[2], Rthreshold.param[3])
  Tq <- Sq[match(datpoly$MarkerName, names(Sq))]
  datpoly$ratio[datpoly$R < Tq] <- NA
  result <- list(datpoly=datpoly)
  if (savediplo) {
    datdiplo <- datdiplo[datdiplo$MarkerName %in% selmarkers,]
    datdiplo$MarkerName <- factor(datdiplo$MarkerName) #drop unused levels
    Tq <- Sq[match(datdiplo$MarkerName, names(Sq))]
    datdiplo$ratio[datdiplo$R < Tq] <- NA
    result$datdiplo <- datdiplo
  } else result$datdiplo <- NA

  if (!is.na(out)) {
    if (filetype=="dat") {
      writeDatfile(datpoly, paste(out,"poly.dat", sep="_"))
      if (savediplo) {
        writeDatfile(datdiplo, paste(out,"diplo.dat", sep="_"))
      }
    } else {
      #write RData files:
      save(datpoly, file=paste(out,"poly.RData", sep="_"))
      if (savediplo) {
        save(datdiplo, file=paste(out,"diplo.RData", sep="_"))
      }
    }
  }
  invisible(result)
} #makeFitPolyFiles

# some general functions ******************************************************

check.filename <- function(filename, overwrite=TRUE) {
  #Check if a filename is valid for writing by trying to create the file
  #If the filename contains a path, result will only be TRUE if the whole
  #path already exists
  #overwrite: if TRUE (default) an existing file of that name will be deleted
  #           (if it is not protected by being locked, read-only, owned by
  #           another user etc)
  filename <- filename[1] # we test only the first filename
  if (!overwrite && file.exists(filename)) return(FALSE)
  tryCatch(
    { suppressWarnings(
      if (file.create(filename)) {
        file.remove(filename)
        TRUE
      } else FALSE)
    }, error = function(e) { FALSE }
  )
} #check.filename

#'@title User-friendly wrapper for read.table
#'@description  A wrapper for read.table that has default parameter values for
#'reading tab-separated files as used in packages fitPoly and fitPolyTools
#'@usage readDatfile(file, header=TRUE, sep="\t", check.names=FALSE, ...)
#'@param file the name of the file which the data are to be read from
#'@param header a logical value indicating whether the file contains the names
#'of the data.frame columns as its first line
#'@param sep the field separator character
#'@param check.names logical. If FALSE (default), column names are not checked.
#'This is important if column names are the names of samples, markers etc that
#'may not be syntactically valid variable names.\cr
#'If TRUE then the names of the variables in the data frame are checked to
#'ensure that they are syntactically valid variable names. If necessary they
#'are adjusted (by make.names) so that they are, and also to ensure that there
#'are no duplicates
#'@param ... Further arguments to be passed to read.table
#'@return A data.frame containing a representation of the data in the file
#'@export
readDatfile <- function(file, header=TRUE, sep="\t", check.names=FALSE, ...) {
  #calls read.table with changed defaults, for reading a tab-separated
  #text file, and without modifying column names
  #The only compulsory argument is the file name
  return(read.table(file=file, header=header, sep=sep,
                    check.names=check.names, ...))
} #readDatfile

#'@title User-friendly wrapper for write.table
#'@description A wrapper for write.table that has default parameter values for
#'writing tab-separated files as used in packages fitPoly and fitPolyTools
#'@usage writeDatfile(x, file, quote=FALSE, sep="\t", na="",
#'row.names=FALSE, col.names=TRUE, logical_01=FALSE, ...)
#'@param x the object to be written, preferably a matrix or data frame.
#'If not, it is attempted to coerce x to a data frame.
#'@param file either a character string naming a file or a connection open for
#'writing. "" indicates output to the console.
#'@param quote a logical value (TRUE or FALSE) or a numeric vector. If TRUE,
#'any character or factor columns will be surrounded by double quotes.
#'If a numeric vector, its elements are taken as the indices of columns to
#'quote. In both cases, row and column names are quoted if they are written.
#'If FALSE (default), nothing is quoted.
#'@param sep the field separator string. Values within each row of x are
#'separated by this string.
#'@param na the string to use for missing values in the data
#'@param row.names either a logical value indicating whether the row names of x
#'are to be written along with x, or a character vector of row names to be
#'written.
#'@param col.names either a logical value indicating whether the column names
#'of x are to be written along with x, or a character vector of column names
#'to be written.
#'@param logical_01 FALSE (default) or TRUE; if TRUE, logical columns of x
#'are converted to 0/1/NA numeric values before writing the file
#'@param ... Further arguments to be passed to write.table
#'@return NULL
#'@export
writeDatfile <- function(x, file, quote=FALSE, sep="\t", na="",
                         row.names=FALSE, col.names=TRUE,
                         logical_01=FALSE, ...) {
  #calls write.table with changed defaults, for writing a tab-separated
  #text file
  #The only compulsory arguments are x and the file name
  if (logical_01) {
    #convert logicals to 0/1/NA for writing text file
    if (is.logical(x)) x[x] <- 1 else if (is.data.frame(x)) {
      for (col in seq_len(ncol(x))) {
        if ("logical" %in% class(x[, col])) {
          xc <- x[,col]
          xc[xc] <- 1 #with NAs present we cannot do x[x[,col],col] <- 1
          x[,col] <- xc
        }
      }
    }
  }
  write.table(x=x, file=file, quote=quote, sep=sep, na=na,
              row.names=row.names, col.names=col.names, ...)
} #writeDatfile

#padded returns a string representing positive integer x left-padded with 0's
#to the length of integer maxx or of max(x)
padded <- function(x, maxx=0) {
  formatC(x, width=nchar(max(x, maxx, na.rm=TRUE)), flag="0")
}



#*************************************************************************

gcd <- function(x,y) {
  #return the greatest common divisor of two integers
  #(vectorized: x and y may be vectors)
  r <- x %% y;
  return(ifelse(r, gcd(y, r), y))
} #gcd

gcd_all <- function(x) {
  #return the greatest common divisor of all elements of vector x
  if (length(x) == 1) x else {
    if (length(x) == 2) gcd(x[1], x[2]) else
      gcd_all(c(gcd(x[1], x[2]), x[3:length(x)]))
  }
} #gcd_all

polygamfrq <- function(ploidy, dosage) {
  #get the integer ratios of gametes with dosage 0:(ploidy/2)
  #from an autopolyploid parent (polysomic inheritance)
  #with given ploidy and dosage
  #the result can be converted to fractions by y <- x/(sum(x))
  gp <- ploidy/2 #gamete ploidy

  #using hypergeometric distribution:
  #dhyper(0:gp, dosage, ploidy-dosage, gp)

  #using n-over-k: function choose, in order to return integer ratios:
  A <- choose(dosage, 0:gp)
  B <- choose(ploidy-dosage, gp-(0:gp))
  #C <- choose(ploidy, gp)
  #A * B / C this would give the answer as fractions,
  # like dhyper(0:gp, dosage, ploidy-dosage, gp)
  D <- round(A*B)
  d <- gcd_all(D[D>0])
  round(D/d)
}

digamfrq <- function(ploidy, dosage) {
  #get the integer ratios of gametes with dosage 0:(ploidy/2)
  #from an allopolyploid parent (disomic inheritance)
  #with given ploidy and dosage
  gp <- ploidy/2 #gamete ploidy
  #first step: calculate all possible compositions of diploid parental genomes
  #i.e. the numbers of subgenomes with nulliplex, simplex and duplex dosages
  maxdup <- dosage %/% 2 #max nr of subgenomes with duplex dosage
  mindup <- max(dosage-gp, 0) #min nr of subgenomes with duplex dosage
  genomes <- matrix(integer((maxdup-mindup+1)*3), ncol=3)
  colnames(genomes) <- c("nulli", "sim", "du")
  for (i in 1:nrow(genomes)) {
    genomes[i,3] <- mindup + i - 1
    genomes[i,2] <- dosage - 2*genomes[i,3]
    genomes[i,1] <- gp - sum(genomes[i, 2:3])
  }
  #next step: per genomic composition calculate the gamete distributions:
  gam <- matrix(numeric(nrow(genomes)*(gp+1)), nrow=nrow(genomes)) #all 0
  colnames(gam) <- 0:gp
  for (i in 1:nrow(genomes)) {
    mingamdos <- genomes[i,3] #each duplex genome contributes 1
    maxgamdos <- gp-genomes[i,1] #each nulliplex genome contributes 0
    bincoeff <- choose(genomes[i,2], 0:genomes[i,2])

    # to return fractions:
    #gam[i, (mingamdos+1):(maxgamdos+1)] <- bincoeff / sum(bincoeff)

    #to return integer ratios:
    gam[i, (mingamdos+1):(maxgamdos+1)] <- bincoeff
    #the smallest bincoeff is always 1 so division by the gcd is not needed
  }
  #list(genomes, gam)
  gam
} #digamfrq

makeProgeny <- function(gametes1, gametes2) {
  #gametes1, gametes2: vectors of the integer ratios of the gametes
  #                    produced by the two parents
  #returns a vector with the integer ratios of the F1 progeny
  gp1 <- length(gametes1)-1 #gametes1 ploidy
  gp2 <- length(gametes2)-1 #gametes2 ploidy
  m <- matrix(0, nrow=gp1+gp2+1, ncol=gp2+1)
  for (i in 1:(gp2+1))
    m[i:(i+gp1), i] <- gametes1 * gametes2[i]
  D <- round(rowSums(m)) #the integer ratios of the F1 generation
  names(D) <- 0:(length(D)-1)
  d <- gcd_all(D[D>0])
  round(D/d) #the integer ratios of the F1 generation simplified
} #makeProgeny

# calcSegtypeInfo ***********************************************************
#'@title Build a list of segregation types
#'@description For each possible segregation type in an F1 progeny with given
#'parental ploidy (and ploidy2, if parent2 has a different ploidy than parent1)
#'information is given on the segregation ratios, parental dosages and whether
#'the segregation is expected under polysomic, disomic and/or mixed inheritance.
#'
#'@usage calcSegtypeInfo(ploidy, ploidy2=NULL)
#'
#'@param ploidy The ploidy of parent 1 (must be even, 2 (diploid) or larger).
#'@param ploidy2 The ploidy of parent 2. If omitted (default=NULL) it is
#'assumed to be equal to ploidy.
#'@details The names of the segregation types consist of a short sequence of
#'digits (and sometimes letters), an underscore and a final number. This is
#'interpreted as follows, for example segtype 121_0: 121 means that there
#'are three consecutive dosages in the F1 population with frequency ratios 1:2:1,
#'and the 0 after the underscore means that the lowest of these dosages is
#'nulliplex. So 121_0 means a segregation of 1 nulliplex : 2 simplex : 1 duplex.
#'A monomorphic F1 (one single dosage) is indicated as e.g. 1_4 (only one
#'dosage, the 4 after the underscore means that this is monomorphic quadruplex).
#'If UPPERCASE letters occur in the first part of the name these are interpreted
#'as additional digits with values of A=10 to Z=35, e.g. 18I81_0 means a
#'segregation of 1:8:18:8:1 (using the I as 18), with the lowest dosage being
#'nulliplex.\cr
#'With higher ploidy levels higher numbers (above 35) may be required.
#'In that case each unique ratio number above 35 is assigned a lowercase letter.
#'E.g. one segregation type in octaploids is 9bcb9_2: a 9:48:82:48:9
#'segregation where the lowest dosage is duplex.\cr
#'Segregation types with more than 5 dosage classes are considered "complex"
#'and get codes like c7e_1 (again in octoploids): this means a complex type
#'(the first c) with 7 dosage classes; the e means that this is the fifth
#'type with 7 classes. Again the _1 means that the lowest dosage is simplex.
#'It is always possible (and for all segtype names with lowercase letters it is
#'necessary) to look up the actual segregation ratios in the intratio item
#'of the segtype. For octoploid segtype c7e_1 this shows 0:1:18:69:104:69:18:1:0
#'(the two 0's mean that nulli- and octoplexes do not occur).
#'
#'@return A list with for each different segregation type (segtype) one item.
#'The names of the items are the names of the segtypes.
#'Each item is itself a list with components:
# (alternatives for itemize: enumerate and describe,
# see https://cran.r-project.org/web/packages/roxygen2/vignettes/formatting.html)
#'\itemize{
#'\item{freq: a vector of the ploidy+1 fractions of the dosages in the F1}
#'\item{intratios: an integer vector with the ratios as the simplest integers}
#'\item{expgeno: a vector with the dosages present in this segtype}
#'\item{allfrq: the allele frequency of the dosage allele in the F1}
#'\item{polysomic: boolean: does this segtype occur with polysomic inheritance?}
#'\item{disomic: boolean: does this segtype occur with disomic inheritance?}
#'\item{mixed: boolean: does this segtype occur with mixed inheritance (i.e. with
#'polysomic inheritance in one parent and disomic inheritance in the other)?}
#'\item{pardosage: integer matrix with 2 columns and as many rows as there
#'are parental dosage combinations for this segtype;
#'each row has one possible combination of dosages for
#'parent 1 (1st column) and parent 2 (2nd column)}
#'\item{parmode: logical matrix with 3 columns and the same number of rows as
#'pardosage. The 3 columns are named polysomic, disomic and mixed and
#'tell if this parental dosage combination will generate this
#'segtype under polysomic, disomic and mixed inheritance}
#'}
#'@examples
#'si4 <- calcSegtypeInfo(ploidy=4) # two 4x parents: a 4x F1 progeny
#'print(si4[["11_0"]])
#'
#'si3 <- calcSegtypeInfo(ploidy=4, ploidy2=2) # a 4x and a diplo parent: a 3x progeny
#'print(si3[["11_0"]])
#'@export
calcSegtypeInfo <- function(ploidy=4, ploidy2=NULL) {
  if (ploidy %% 2 != 0) stop("calcSegtypeInfo: odd ploidy not allowed")
  if (is.null(ploidy2)) ploidy2 <- ploidy else
    if (ploidy2 %% 2 != 0) stop("calcSegtypeInfo: odd ploidy2 not allowed")
  ploidyF1 <- (ploidy + ploidy2) / 2

  result <- list()
  #we check all parental combinations and enumerate the segtypes:
  for (p1 in 0:ploidy) {
    if (p1 %in% c(0, 1, ploidy-1, ploidy)) {
      p1ir <- matrix(polygamfrq(ploidy, p1), nrow=1)
      p1mode <- 0 #0=don't care
    } else {
      p1ir <- rbind(polygamfrq(ploidy, p1), digamfrq(ploidy, p1))
      p1mode <- c(1, rep(2, nrow(p1ir)-1)) #1=polysomic, 2=disomic
    }
    if (ploidy2 == ploidy) maxp2 <- p1 else maxp2 <- ploidy2
    for (p2 in 0:maxp2) {
      if (p2 %in% c(0, 1, ploidy2-1, ploidy2)) {
        p2ir <- matrix(polygamfrq(ploidy2, p2), nrow=1)
        p2mode <- 0 #0=don't care
      } else {
        p2ir <- rbind(polygamfrq(ploidy2, p2), digamfrq(ploidy2, p2))
        p2mode <- c(1, rep(2, nrow(p2ir)-1)) #1=polysomic, 2=disomic
      }
      for (i1 in 1:nrow(p1ir)) for (i2 in 1:nrow(p2ir)) {
        ir <- makeProgeny(p1ir[i1,], p2ir[i2,])
        #check if this segtype already exists in result:
        r <- length(result)
        while (r > 0 && sum(ir != result[[r]]$intratios) > 0) r <- r-1
        if (r == 0) {
          #new segtype:
          r <- length(result) + 1
          result[[r]] <- list()
          result[[r]]$freq <- ir/sum(ir)
          result[[r]]$intratios <- ir
          result[[r]]$expgeno <- which(ir > 0) - 1; names(result[[r]]$expgeno) <- NULL
          result[[r]]$allfrq <- sum(result[[r]]$freq * 0:ploidyF1) / ploidyF1
          result[[r]]$polysomic <- FALSE
          result[[r]]$disomic <- FALSE
          result[[r]]$mixed <- FALSE
          result[[r]]$pardosage <- matrix(c(p1, p2), nrow=1)
          colnames(result[[r]]$pardosage) <- c("P1", "P2")
          result[[r]]$parmode <- matrix(FALSE, nrow=1, ncol=3)
          colnames(result[[r]]$parmode) <- c("polysomic", "disomic", "mixed")
          p <- 1
        } else {
          #segtype already existed, see if parental combination
          #already existed:
          p <- nrow(result[[r]]$pardosage)
          while (p > 0 &&
                 sum(c(p1,p2) != result[[r]]$pardosage[p,]) > 0) p <- p-1
          if (p == 0) {
            #this parental dosage combination is new
            p <- nrow(result[[r]]$pardosage) + 1
            result[[r]]$pardosage <- rbind(result[[r]]$pardosage, c(p1, p2))
            result[[r]]$parmode <- rbind(result[[r]]$parmode,
                                         rep(FALSE, 3))
          }
        }
        result[[r]]$parmode[p, 1] <- result[[r]]$parmode[p, 1] ||
          (p1mode[i1] != 2 && p2mode[i2] != 2) #is polysomic?
        result[[r]]$parmode[p, 2] <- result[[r]]$parmode[p, 2] ||
          (p1mode[i1] != 1 && p2mode[i2] != 1) #is disomic?
        result[[r]]$parmode[p, 3] <- result[[r]]$parmode[p, 3] ||
          (p1mode[i1] == 0 || p2mode[i2] == 0 || p1mode[i1] != p2mode[i2]) #mixed?
      } #for i1 for i2
    } #for p2
  } #for p1
  #next we check if each segtype occurs with poly- or disomic or
  #mixed inheritance,
  #we add all reciprocal parental dosages,
  #we order all pardosage and parmode in order of increasing p1,
  #we check which intratios occur in segtypes with less than complex classes,
  #and we compile the numdosages and firstdosage statistics:
  numdosages <- firstdosage <- maxratio <- integer(length(result))
  complex <- 6 #the number of dosage classes from which we consider the segtype
  #             complex, i.e. we don't show the full segregation in the name
  simpleratios <- integer(0)
  for (r in seq_along(result)) {
    #calculate poly- and disomic and mixed:
    result[[r]]$polysomic <- sum(result[[r]]$parmode[, 1]) > 0
    result[[r]]$disomic <- sum(result[[r]]$parmode[, 2]) > 0
    result[[r]]$mixed <- sum(result[[r]]$parmode[, 3]) > 0
    #add the reverse of parents with unequal dosage:
    if (ploidy2 == ploidy) {
      #if ploidy2 not null, all parental combinations already done
      uneq <- result[[r]]$pardosage[,1] != result[[r]]$pardosage[,2]
      if (sum(uneq) > 0) {
        #there are rows with unequal parental dosage
        result[[r]]$pardosage <- rbind(result[[r]]$pardosage,
                                       result[[r]]$pardosage[uneq, 2:1])
        result[[r]]$parmode <- rbind(result[[r]]$parmode,
                                     result[[r]]$parmode[uneq,])
      }
    }
    #sort pardosage and parmode in order of increasing P1 dosage:
    or <- order(result[[r]]$pardosage[,1])
    result[[r]]$pardosage <- result[[r]]$pardosage[or,, drop=FALSE]
    result[[r]]$parmode <- result[[r]]$parmode[or,, drop=FALSE]
    #add the intratios in not complex:
    if (length(result[[r]]$expgeno) < complex)
      simpleratios <- union(simpleratios, result[[r]]$intratios)
    #calculate statistics for later ordering of result list:
    numdosages[r] <- length(result[[r]]$expgeno)
    firstdosage[r] <- result[[r]]$expgeno[1]
    maxratio[r] <- max(result[[r]]$intratios)
  }
  #next we reorder result from simple to complex segtypes
  or <- order(numdosages, maxratio, firstdosage)
  result <- result[or]
  #finally we assign names according to numdosages and firstdosage:
  numdosages <- numdosages[or]
  firstdosage <- firstdosage[or]
  maxratio <- maxratio[or]
  simple <- max(which(numdosages < complex))
  alldigits <- c("0","1","2","3","4","5","6","7","8","9", LETTERS)
  #we translate the intratios of non-complex segtypes to a sequence of digits
  #and uppercase letters (for ratios 10:35); if any ratios > 35 occur (as in
  #octo- and dodecaploids) we assign one of the lowercase letters for each of
  #these dosages (and assume there are less than 27 unique ratios above 35).
  #Just using the lowercase letter for ratios 36:61 does not work as already
  #with octoploids some intratios of non-complex segtypes are above 61.
  simpleratios <- sort(simpleratios[simpleratios > 35])
  for (r in 1:simple) {
    chr <- character(complex-1)
    name <- ""
    ir <- result[[r]]$intratios
    for (i in 1:(ploidyF1+1)) {
      if (ir[i] > 0) {
        if (ir[i] <= 35) {
          name <- paste(name, alldigits[ir[i] + 1], sep="")
        } else name <- paste(name, letters[which(simpleratios == ir[i])], sep="")
      }
    }
    names(result)[r] <- paste(name, firstdosage[r], sep="_")
  }
  if (complex <= max(numdosages)) for (n in complex:max(numdosages)) {
    #we want to assign codes like c6a_1, where:
    #c just means complex
    #6 is the number of different dosages in the segtype
    #c6a, c6b etc are segtypes with different ratio combinations, all with 6 dosages
    #_1 means that the first dosage is simplex
    selnum <- numdosages == n
    maxrat <- sort(unique(maxratio[selnum]))
    for (m in 1:length(maxrat)) {
      seltype <- selnum & maxratio == maxrat[m]
      if (length(maxrat) == 1) name <- paste("c", n, sep="") else
        name <- paste("c", n, letters[m], sep="")
      firstrange <- sort(unique(firstdosage[selnum]))
      for (f in firstrange) {
        sel <- seltype & firstdosage == f
        names(result)[sel] <- paste(name, f, sep="_")
      }
    }
  }
  result
} #calcSegtypeInfo

#'@title conversion of segtype code to F1 segregation ratios
#'@description Produce a matrix with the F1 segregation ratios (as integers)
#'for all segregation types for the given ploidy
#'@usage listSegtypes(ploidy, ploidy2=NULL)
#'@param ploidy the ploidy of parent 1 of the F1, or of the F1 itself if
#'ploidy2 is NULL
#'@param ploidy2 the ploidy of parent 2 of the F1
#'@details If ploidy2 is not NULL, ploidy and ploidy2 are the ploidy levels of
#'the two parents, and both must be even. If ploidy2 is NULL, ploidy is
#'the ploidy of the F1; if even, both parents are assumed to have the same
#'ploidy; if odd, parent 1 and parent 2 are asumed to have ploidy-1 and ploidy+1.
#'This function calls calcSegtypeInfo and is a convenience function; it is not
#'used by any other functions. For more information, including parental
#'dosages for each segregation, use calcSegtypeInfo and segtypeInfoSummary.
#'@return a matrix with one row for each segregation type and one column for
#'each possible F1 dosage, with the integer ratios of the dosages for each
#'segregation type.
#'@export
listSegtypes <- function(ploidy, ploidy2=NULL) {
  if (is.null(ploidy2)) {
    if (ploidy %% 2 != 0) {
      #odd ploidy
      ploidy2 <- ploidy +1
      ploidy <- ploidy - 1
    } else ploidy2 <- ploidy
  } else {
    if (ploidy %% 2 != 0 || ploidy2 %%2 != 0)
      stop("listSegtypes: odd parental ploidy not allowed")
  }
  ploidyF1 <- (ploidy+ploidy2)/2
  seginfo <- calcSegtypeInfo(ploidy=ploidy, ploidy2=ploidy2)
  result <- matrix(nrow=length(seginfo), ncol=ploidyF1 + 1,
                   dimnames=list(names(seginfo), 0:ploidyF1))
  for (i in 1:length(seginfo))
    result[i,] <- seginfo[[i]]$intratios
  result
} #listSegtypes

# selSegtypeInfo ***********************************************************
#'@title Restrict a list of segregation types to specified inheritance modes
#'@description From a list of segregation types as produced by calcSegtypeInfo,
#'this function selects only those segtypes that occur with polysomic,
#'disomic and/or mixed inheritance if the corresponding parameters are set to
#'TRUE, and from those segtypes only the parental dosages with the same
#'restrictions are retained.
#'@usage selSegtypeInfo(segtypeInfo, polysomic, disomic, mixed,
#'selfing=FALSE)
#'@param segtypeInfo a list as returned by calcSegtypeInfo
#'@param polysomic If TRUE all segtypes with poly TRUE are retained, and from
#'those segtypes all parental dosage combinations with parmode[,1] TRUE
#'@param disomic If TRUE all segtypes with di TRUE are retained, and from
#'those segtypes all parental dosage combinations with parmode[,2] TRUE
#'@param mixed If TRUE all segtypes with mixed TRUE are retained, and from
#'those segtypes all parental dosage combinations with parmode[,3] TRUE
#'@param selfing if TRUE only segtypes are retained that can be the result of
#'a selfing, i.e. both parental dosages equal (although it allows disomic
#'segtypes where the parental distribution of alleles over the subgenomes is
#'different, e.g. aa/bb x ab/ab -> 121_1); default FALSE
#'@return A list like segtypeInfo, modified as specified by parameters polysomic,
#'disomic and mixed
#'@export
selSegtypeInfo <- function(segtypeInfo, polysomic, disomic, mixed,
                           selfing=FALSE) {
  result <- list()
  selected <- integer(0)
  for (s in seq_along(segtypeInfo)) {
    if ((polysomic && segtypeInfo[[s]]$polysomic) ||
        (disomic && segtypeInfo[[s]]$disomic) ||
        (mixed && segtypeInfo[[s]]$mixed)) {
      stype <- segtypeInfo[[s]]
      selpar <- (polysomic & stype$parmode[,1]) |
        (disomic & stype$parmode[,2]) |
        (mixed & stype$parmode[,3])
      stype$pardosage <- stype$pardosage[selpar,, drop=FALSE]
      stype$parmode <- stype$parmode[selpar,, drop=FALSE]
      if (selfing) {
        selpar <- stype$pardosage[,1] == stype$pardosage[,2]
        stype$pardosage <- stype$pardosage[selpar,, drop=FALSE]
        stype$parmode <- stype$parmode[selpar,, drop=FALSE]
      }
      if (nrow(stype$pardosage) > 0) {
        selected <- c(selected, s)
        r <- length(result) + 1
        result[[r]] <- stype
      }
    }
  }
  names(result) <- names(segtypeInfo)[selected]
  result
} #selSegtypeInfo

# segtypeInfoSummary *********************************************************
#'@title Summarize the segtypeInfo list
#'@description From a list of segregation types as produced by calcSegtypeInfo
#'or selSegtypeInfo, produce a data frame that only lists the parental
#'dosage combinations for each segtype and whether these produce the
#'segtype under polysomic, disomic and/or mixed inheritance.
#'Useful to quickly look up which segtypes match a given parental dosage
#'combination.
#'@usage segtypeInfoSummary(segtypeInfo)
#'
#'@param segtypeInfo a list as returned by calcSegtypeInfo or selSegtypeInfo
#'@return A data frame summarizing the segtypeInfo list, with columns:
#'\itemize{
#'\item{segtype: the name of the segregation type (see details of
#'calcSegtypeInfo)}
#'\item{segtypenr: the sequential number of the segtype in parameter segtypeInfo}
#'\item{parent1, parent2: dosages of the two parents}
#'\item{par.poly, par.di, par.mixed: whether these parental dosages produce this
#'segtype under polysomic, disomic and/or mixed inheritance}
#\item{segtype.poly, segtype.di, segtype.mixed: whether this segtype does occur
#under polysomic, disomic and/or mixed inheritance}
#'}
#'@export
segtypeInfoSummary <- function(segtypeInfo) {
  totrows <- 0
  for (i in 1:length(segtypeInfo))
    totrows <- totrows + nrow(segtypeInfo[[i]]$pardosage)
  result <- data.frame (segtype=character(totrows),
                        segtypenr=integer(totrows),
                        parent1=integer(totrows),
                        parent2=integer(totrows),
                        par.poly=integer(totrows),
                        par.di=integer(totrows),
                        par.mixed=integer(totrows),
                        #segtype.poly=integer(totrows),
                        #segtype.di=integer(totrows),
                        #segtype.mixed=integer(totrows),
                        stringsAsFactors=FALSE)
  r <- 1
  for (i in 1:length(segtypeInfo)) {
    for (p in 1:nrow(segtypeInfo[[i]]$pardosage)) {
      result$segtype[r] = names(segtypeInfo)[i]
      result$segtypenr[r] = i
      result$parent1[r] = segtypeInfo[[i]]$pardosage[p,1]
      result$parent2[r] = segtypeInfo[[i]]$pardosage[p,2]
      result$par.poly[r] = segtypeInfo[[i]]$parmode[p,1]
      result$par.di[r] = segtypeInfo[[i]]$parmode[p,2]
      result$par.mixed[r] = segtypeInfo[[i]]$parmode[p,3]
      #result$segtype.poly[r] = 0 + segtypeInfo[[i]]$polysomic
      #result$segtype.di[r] = 0 + segtypeInfo[[i]]$disomic
      #result$segtype.mixed[r] = 0 + segtypeInfo[[i]]$mixed
      r <- r + 1
    }
  }
  result
} #segtypeInfoSummary

getMatchParents <- function(parGeno, seginfoItem) {
  #parGeno: integer vector of length 2 with the two parental (consensus)
  #         dosages (can each be NA)
  #seginfoItem: one of the items (segtypes) of a list as returned by
  #          calcSegtypeInfo
  #does the specified segtype match the parental genotypes?
  p <- which(!is.na(parGeno))
  if (length(p) == 0) {
    #both parental genotypes unknown
    return("Unknown")
  } else if (length(p) == 1) {
    #only one of the parental genotypes known
    if (sum(seginfoItem$pardosage[, p] == parGeno[p]) > 0)
      return("OneOK") else return("No")
  } else {
    #both parental genotypes known
    i <- which(seginfoItem$pardosage[, 1] == parGeno[1])
    if (length(i) == 1 && parGeno[2] == seginfoItem$pardosage[i, 2])
      return("Yes") else return("No")
  }
} #getMatchParents

getConsensusGeno <- function(geno, maxNAfrac=0.499,
                             lowconf.NAfrac=0.751) {
  #geno: a vector with scores (0..ploidy or NA) for all samples of one parent.
  #maxNAfrac: the maximum fraction of missing scores allowed to assign a
  #        high-confidence consensus geno score; if more are
  #        missing a missing geno score is returned (the default requires
  #        MORE than half of the samples to be scored, so one missing out of
  #        2 samples already causes a missing geno)
  #lowconf.NAfrac: if the fraction missing scores is more than maxNAfrac but
  #        less than lowconf.NAfrac, or if there is only one sample,
  #        a lowconf.geno score is assigned
  #returns a list with 3 components:
  # $geno: the consensus of the dosages in vector geno. NA if geno is empty,
  #        if all elements of geno are NA, if there are different non-NA
  #        values in geno, or if the fraction of NA values in geno larger than
  #        maxNAfrac
  # $lowconf.geno: if there is no conflict and the fraction of missing scores
  #        is between maxNAfrac and lowconf.NAfrac, the consensus is assigned
  #        as a "low-confidence" geno - this needs confirmation from the
  #        F1 segregation to be accepted
  # $conflict: TRUE if there are different non-NA values in geno, else FALSE
  # $NAfrac: the fraction NA in geno (0.5 if length(geno)==0)
  nwgeno <- NA; nwlogeno <- NA; conflict <- FALSE; NAfrac <- 0
  if (length(geno) == 0) NAfrac <- 0.5 else {   #we need a NAfrac value for q2
    NAfrac <- sum(is.na(geno) / length(geno))
    if (NAfrac < 1.0) {
      if (length(geno) == 1) nwlogeno <- geno else {
        if (max(geno, na.rm=TRUE) != min(geno, na.rm=TRUE)) {
          conflict <- TRUE
        } else  if (NAfrac <= maxNAfrac) {
          nwgeno <- min(geno, na.rm=TRUE)
        } else if (NAfrac <= lowconf.NAfrac) {
          nwlogeno <- min(geno, na.rm=TRUE)
        }
      }
    }
  }
  list(geno=nwgeno, lowconf.geno=nwlogeno, conflict=conflict, NAfrac=NAfrac)
} #getConsensusGeno

shiftdosages <- function(dosages, shift, ploidy) {
  #dosages: vector of integer dosages, shift: one shift value
  dosages <- dosages + shift
  below <- !is.na(dosages) & dosages < 0
  dosages[below] <- 0
  above <- !is.na(dosages) & dosages > ploidy
  dosages[above] <- ploidy
  dosages
} #shiftdosages

checkSelfing <- function(parent1, parent2) {
  parent1 <- sort(as.character(parent1[!is.na(parent1)]))
  parent2 <- sort(as.character(parent2[!is.na(parent2)]))
  length(parent1) == length(parent2) && length(parent1) > 0 &&
    sum(parent1 != parent2) == 0
} #checkSelfing

linterpol <- function(x, pnt1, pnt2) {
  #linear interpolation: returns the y values matching the x values (vector)
  #on the line through points pnt1 and pnt2 (both are vectors of length 2)
  #possible error if both X-coordinates equal not checked or caught
  a <- (pnt2[2] - pnt1[2]) / (pnt2[1] - pnt1[1])
  b <- pnt1[2] - a * pnt1[1]
  a * x + b
} #linterpol

calc_q1 <- function(Pvalue, fracInvalid, bestfit, bestParentfit,
                    Pvalue_threshold, fracInvalid_threshold) {
  #q1 is a measure of the fit of selfit. It is based on Pvalue and the fraction
  #of valid F1 scores. If either is below their threshold the q1 value is 0.
  #Also, if bestParentfit is different from bestfit this indicates a segregation
  #distortion.
  #x compares bestfit and bestParentfit:
  if (bestfit == bestParentfit) {
    x <- 1
  } else {
    x <- 0.5
  }
  #Version 20150220: nonzero but low y values also below P=0.001 depending
  #on threshold:
  if (Pvalue < Pvalue_threshold) y <- 0 else {
    if (Pvalue < 0.01) {
      if (Pvalue < 0.001) y <- linterpol(Pvalue, c(0,0), c(0.001, 0.05)) else
        y <- linterpol(Pvalue, c(0.001, 0.05), c(0.01, 0.6))
    } else if (Pvalue < 0.15) {
      if (Pvalue < 0.05) y <- linterpol(Pvalue, c(0.01, 0.6), c(0.05, 0.8)) else
        y <- linterpol(Pvalue, c(0.05, 0.8), c(0.15, 1))
    } else y <- 1
  }
  #z is determined by fraction F1 scores in valid categories for the selfit
  #segtype. z is equal to 1-fracInvalid, but below 1-frqinvalid.threshold z=0.
  z <- ifelse(fracInvalid > fracInvalid_threshold, 0, 1 - fracInvalid)
  x * y * z
} #calc_q1

calc_q2 <- function(par.conflicts, par.NAfrac, matchParents,
                    parentsScoredWithF1) {
  #q2 is about how well the parents are scored and fit with the selfit segtype
  if (parentsScoredWithF1) {
    #q2 is determined not by matchParents but by the amount
    #of conflicts and missing values in the parents:
    q2 <- max(0, 1 - 0.5 * sum(par.conflicts) - 0.5 * sum(par.NAfrac))
    #In the version of 20150220 matchParents cannot be "No": selfit is always
    #bestParentfit (sometimes after promoting lowconf parental scores).
    #Therefore the remaining matchParents codes Unknown, OneOK and Yes are
    #correlated with the q2 score, no point to use the matchParents
  } else {
    # !parentsScoredWithF1
    # In this case par.conflicts and par.NAfrac are not used and the information
    # on the match between parents and F1 segtype has to come from matchParents.
    # Note that matchParents=="No" cannot occur in the versions since 20150220.
    q2 <- match(matchParents,
                c("No", "Unknown", "OneOK", "Yes")) #1:4 or NA
    #q2 <-  c(0, 0.5, 0.9, 1)[q2] #translate into 0..1 values
    q2 <-  c(0, 0.65, 0.9, 1)[q2] #translate into 0..1 values
    #version of 20160324: the penalty for Unknown is less if the parents
    #are not scored with the F1: there is still the lack of information, but
    #the lack of scores does not mean that the (F1) genotyping is less reliable.
  }
  q2
} #calc_q2

calc_q3 <- function(F1.NAfrac, fracNA_threshold) {
  #q3 is about the fraction missing F1 scores.
  #If that is larger than NA.threshold, q3=0;
  #else q3 depends on the fraction scored F1's
  if (F1.NAfrac > fracNA_threshold) {
    q3 <- 0
  } else if (fracNA_threshold > 0.2) {
    if (F1.NAfrac >= 0.2) {
      # 0.2<=F1.NAfrac<threshold
      q3 <- linterpol(F1.NAfrac, c(0.2, 0.5), c(fracNA_threshold, 0))
    } else {
      # 0<=F1.NAfrac<0.2
      q3 <- linterpol(F1.NAfrac, c(0, 1), c(0.2, 0.5))
    }
  } else {
    # fracNA_threshold <= 0.2, 0<=F1.NAfrac<threshold
    #: let q3 go from 1 at frac=0 to 0.5 at frac=threshold:
    q3 <- linterpol(F1.NAfrac, c(0, 1), c(fracNA_threshold, 0.5))
  }
  q3
} #calc_q3

calc_qall <- function(Pvalue_threshold, fracInvalid_threshold,fracNA_threshold,
                      Pvalue, fracInvalid, F1.NAfrac,
                      matchParents,
                      bestfit, bestParentfit,
                      par.conflicts, par.NAfrac,
                      critweight,
                      parentsScoredWithF1) {
  # 3 thresholds: can be disabled by setting to 0, 1 and 1
  # Pvalue, fracInvalid and matchParents are properties of the selfit segtype
  # F1.NAfrac is the fraction missing scores among F1 samples,
  # bestfit and bestParentfit are two segtype, selfit is one of these
  # par.conflicts is a logical vector: is there a conflict for P1, P2?
  # par.NA is the number of NA scores for the samples of P1, P2
  # critweight is a numeric vector of length 3 (3 quality criteria)
  #            with their weights, or NA (in that case qall = q1*q2*q3, not a
  #            weighted average)
  # return value: a vector of length 4 or 5. The last (two) components of the
  #               vector are qall_mult (obtained by multiplication of q1..q3),
  #               and if critweights is supplied, qall_weights (a weighted
  #               average of q1..q3)
  q <- numeric(ifelse (length(critweight) == 3, 5, 4)) #last (two) are qall
  q[1] <- calc_q1(Pvalue, fracInvalid, bestfit, bestParentfit,
                  Pvalue_threshold, fracInvalid_threshold)
  q[2] <- calc_q2(par.conflicts, par.NAfrac, matchParents,
                  parentsScoredWithF1)
  q[3] <- calc_q3(F1.NAfrac, fracNA_threshold)
  q[4] <- prod(q[1:3]) #q[4] is qall_mult
  if (length(critweight) == 3) {
    q[5] <- sum(q[1:3] * critweight) / sum(critweight) #q[5] is qall_weights
  }
  q
} #calc_qall

# checkF1 *******************************************************************
#'@title Identify the best-fitting F1 segregation types
#'@description For a given set of F1 and parental samples, this function
#'finds the best-fitting segregation type. It can perform a dosage shift prior
#'to selecting the segregation type.
#'
#'@usage checkF1(scores, parent1, parent2, F1, ancestors=character(0),
#'polysomic, disomic, mixed, ploidy, ploidy2, outfile,
#'critweight=c(1.0, 0.4, 0.4),
#'scores_long=TRUE, Pvalue_threshold=0.0001, fracInvalid_threshold=0.05,
#'fracNA_threshold=0.25, shiftmarkers, parentsScoredWithF1,
#'shiftParents=parentsScoredWithF1, showAll=FALSE, append_shf=FALSE)
#'
#'@param scores A data frame as read from the scores file produced by function
#'fitMarkers of package fitPoly (or a subset with at least columns
#'MarkerName, SampleName and geno), or a data frame as returned by function
#'scores2wide. In the first case (default) parameter scores_long must be TRUE,
#'in the second case it must be FALSE.
#'@param parent1 character vector with the sample names of parent 1
#'@param parent2 character vector with the sample names of parent 2
#'@param F1 character vector with the sample names of the F1 individuals
#'@param ancestors character vector with the sample names of any other
#'ancestors or other samples of interest. The dosages of these samples will
#'be shown in the output (shifted if shiftParents TRUE) but they are not used
#'in the selection of the segregation type.
#'@param polysomic if TRUE at least all polysomic segtypes are considered;
#'if FALSE these are not specifically selected (but if e.g. disomic is TRUE,
#'any polysomic segtypes that are also disomic will still be considered)
#'@param disomic if TRUE at least all disomic segtypes are considered (see
#'param polysomic)
#'@param mixed if TRUE at least all mixed segtypes are considered (see
#'param polysomic). A mixed segtype occurs when inheritance in one parent is
#'polysomic (random chromosome pairing) and in the other parent disomic (fully
#'preferential chromosome pairing)
#'@param ploidy The ploidy of parent 1 (must be even, 2 (diploid) or larger).
#'@param ploidy2 The ploidy of parent 2. If omitted it is
#'assumed to be equal to ploidy.
#'@param outfile the tab-separated text file to write the output to; if NA a
#'temporary file checkF1.tmp is created in the current working directory
#'and deleted at end
#'@param critweight NA or a numeric vector containing the weights of three quality
#'criteria; do not need to sum to 1. If NA, the output will not contain a
#'column qall_weights. Else the weights specify how qall_weights will be
#'calculated from quality parameters q1, q2 and q3.
#'@param scores_long TRUE if scores is in "long format", FALSE if it is in
#'"wide format" (see parameter scores)
#'@param Pvalue_threshold a minimum threshold value for the Pvalue of the
#'bestParentfit segtype (with a smaller Pvalue the q1 quality parameter will
#'be set to 0)
#'@param fracInvalid_threshold a maximum threshold for the fracInvalid of the
#'bestParentfit segtype (with a larger fraction of invalid dosages in the F1
#'the q1 quality parameter will be set to 0)
#'@param fracNA_threshold a maximum threshold for the fraction of unscored F1
#'samples (with a larger fraction of unscored samples in the F1
#'the q3 quality parameter will be set to 0)
#'@param shiftmarkers if specified, shiftmarkers must be a data frame with
#'columns MarkerName and shift; for the markernames that match exactly
#'(upper/lowercase etc) those in scores, the dosages are increased by the
#'amount specified in column shift,
#'e.g. if shift is -1, dosages 2..ploidy are converted to 1..(ploidy-1)
#'and dosage 0 is a combination of old dosages 0 and 1, for all samples.
#'The segregation check
#'is then performed with the shifted dosages.
#'A shift=NA is allowed, these markers will not be shifted.
#'The sets of markers in scores and shiftmarkers
#'may be different, but markers may occur only once in shiftmarkers.
#'A column shift is added at the end of the returned data frame.\cr
#'If parameter shiftParents is TRUE, the parental and ancestor scores are
#'shifted as the F1 scores, if FALSE they are not shifted.
#'@param parentsScoredWithF1 TRUE means parents are scored in the same experiment
#'and the same fitPoly run as the F1, else FALSE.
#'If TRUE, the fraction missing scores and conflicts in the parents
#'tell something about the quality of the scoring. If FALSE
#'(e.g. when the F1 is triploid and the parents are diploid and tetraploid) the
#'quality of the F1 scores can be independent of that of the parents.\cr
#'If not specified, TRUE is assumed if ploidy2 == ploidy and FALSE if
#'ploidy2 != ploidy
#'@param shiftParents only used if parameter shiftmarkers is specified. If TRUE,
#'apply the shifts also to the parental and ancestor scores.
#'By default TRUE if parentsScoredWithF1 is TRUE, else FALSE
#'@param showAll (default FALSE) if TRUE, for each segtype 3 columns
#'are added to the returned data frame with the frqInvalid, Pvalue and
#'matchParents values for these segtype (see the description of the return value)
#'@param append_shf if TRUE and parameter shiftmarkers is specified, _shf is
#'appended to all marker names where shift is not 0. This is not required for
#'any of the functions in this package but may prevent duplicated marker names
#'when using other software.

#'@details For each marker is tested how well the different segregation types
#'fit with the observed parental and F1 dosages. The results are summarized
#'by columns bestParentfit (which is the best fitting segregation type,
#'taking into account the F1 and parental dosages) and columns qall_mult
#'and/or qall_weights (how good is the fit of the bestParentfit segtype: 0=bad,
#'1=good).\cr
#'Column bestfit in the results gives the segtype best fitting the F1
#'segregation without taking account of the parents. This bestfit segtype is
#'used by function correctDosages, which tests for possible "shifts" in
#'the marker models. Both bestfit and bestParentfit are restricted by the
#'parameters polysomic, disomic and mixed. Further they are restricted to
#'segtypes that can only occur when the parental dosages are equal, if
#'parent1 and parent2 list the same samples (but not if both are empty).\cr
#'In case the parents are not scored together with the F1 (e.g. if the F1 is
#'triploid and the parents are diploid and tetraploid) the scores data frame
#'should be edited to contain the parental as well as the F1 scores.
#'In case the diploid and tetraploid parent are scored in the same run of
#'function fitMarkers (package fitPoly)
#'the diploid is initially scored as nulliplex-duplex-quadruplex (dosage 0, 2
#'or 4); that must be converted to the true diploid dosage scores (0, 1 or 2).
#'Similar corrections are needed with other combinations, such as a diploid
#'parent scored together with a hexaploid population etc.
#'
#'@return A data frame with one row per markers, with the following columns:
#'\itemize{
#'\item{m: the sequential number of the marker (as assigned by fitPoly)}
#'\item{MarkerName: the name of the marker, with _shf appended if the marker
#'is shifted and append_shf is TRUE}
#'\item{parent1: consensus dosage score of the samples of parent 1}
#'\item{parent2: consensus dosage score of the samples of parent 2}
#'\item{F1_0 ...	F1_<ploidy>: the number of F1 samples with dosage scores
#'0 ... <ploidy>}
#'\item{F1_NA: the number of F1 samples with a missing dosage score}
#'\item{sample names of parents and ancestors: the dosage scores for those
#'samples}
#'\item{bestfit: the best fitting segtype, considering only the F1 samples}
#'\item{frqInvalid_bestfit: for the bestfit segtype, the frequency of F1 samples
#'with a dosage score that is invalid (that should not occur). The frequency is
#'calculated as the number of invalid samples divided by the number of non-NA
#'samples}
#'\item{Pvalue_bestfit: the chisquare test P-value for the observed
#'distribution of dosage scores vs the expected fractions. For segtypes
#'where only one dosage is expected (1_0, 1_1 etc) the binomial probability of
#'the number of invalid scores is given, assuming an error
#'rate of seg_invalidrate (hard-coded as 0.03)}
#'\item{matchParent_bestfit: indication how the bestfit segtype matches the
#'consensus dosages of parent 1 and 2: "Unknown"=both parental
#'dosages unknown; "No"=one or both parental dosages known
#'and conflicting with the segtype; "OneOK"= only one parental
#'dosage known, not conflicting with the segtype; "Yes"=both
#'parental dosages known and combination matching with
#'the segtype. This score is initially assigned based on
#'only high-confidence parental consensus scores; if
#'low-confidence dosages are confirmed by the F1, the
#'matchParent for (only) the selected segtype is
#'updated, as are the parental consensus scores.}
#'\item{bestParentfit: the best fitting segtype that does not conflict with
#'the parental consensus scores}
#'\item{frqInvalid_bestParentfit, Pvalue_bestParentfit,
#'matchParent_bestParentfit: same as the corresponding columns for bestfit.
#'Note that matchParent_bestParentfit cannot be "No".}
#'\item{q1_segtypefit: a value from 0 (bad) to 1 (good), a measure of the fit of
#'the bestParentfit segtype based on Pvalue, invalidP and whether bestfit is
#'equal to bestParentfit}
#'\item{q2_parents: a value from 0 (bad) to 1 (good), based either on the
#'quality of the parental scores (the number of missing scores and of
#'conflicting scores, if parentsScoredWithF1 is TRUE) or on matchParents
#'(No=0, Unknown=0.65, OneOK=0.9, Yes=1, if parentsScoredWithF1 is FALSE)}
#'\item{q3_fracscored: a value from 0 (bad) to 1 (good), based on the fraction
#'of F1 samples that have a non-missing dosage score}
#'\item{qall_mult: a value from 0 (bad) to 1 (good), a summary quality score
#'equal to the product q1*q2*q3. Equal to 0 if any of these is 0, hence
#'sensitive to thresholds; a natural selection criterion would be to accept
#'all markers with qall_mult > 0}
#'\item{qall_weights: a value from 0 (bad) to 1 (good), a weighted average of
#'q1, q2 and q3, with weights as specified in parameter critweight. This column is
#'present only if critweight is specified. In this case there is no "natural"
#'threshold; a threshold for selection of markers must be obtained by inspecting
#'XY-plots of markers over a range of qall_weights values}
#'\item{shift: if shiftmarkers is specified a column shift is added with
#'for all markers the applied shift (for the unshifted markers the shift value
#'is 0)}
#'}
#'qall_mult and/or qall_weights can be used to compare the quality
#'of the SNPs within one analysis and one F1 population but not between analyses
#'or between different F1 populations.\cr
#'If parameter showAll is TRUE there are 3 additional columns for each
#'segtype with names frqInvalid_<segtype>, Pvalue_<segtype> and
#'matchParent_<segtype>; see the corresponding columns for bestfit for an
#'explanation. These extra columns are inserted directly before the bestfit
#'column.
#'@export
checkF1 <- function(scores,
                    parent1,
                    parent2,
                    F1,
                    ancestors=character(0),
                    polysomic, disomic, mixed,
                    ploidy, ploidy2,
                    outfile,
                    critweight=c(1.0, 0.4, 0.4),
                    scores_long=TRUE,
                    Pvalue_threshold=0.0001,
                    fracInvalid_threshold=0.05,
                    fracNA_threshold=0.25,
                    shiftmarkers,
                    parentsScoredWithF1,
                    shiftParents=parentsScoredWithF1,
                    showAll=FALSE,
                    append_shf=FALSE) {
  if (ploidy %% 2 != 0) stop("checkF1: odd ploidy not allowed")
  if (missing(ploidy2) || is.na(ploidy2)) ploidy2 <- ploidy else
    if (ploidy2 %% 2 != 0) stop("checkF1: odd ploidy2 not allowed")
  ploidyF1 <- (ploidy + ploidy2) / 2

  if (missing(parentsScoredWithF1))
    parentsScoredWithF1 <- ploidy2 == ploidy

  #check critweight
  if (is.null(critweight) || is.na(critweight[1])) critweight <- NA else {
    if (!is.numeric(critweight) || length(critweight) != 3 ||
        sum(is.na(critweight)) > 0 || sum(critweight) == 0) {
      stop("invalid critweight")
    }
  }
  #check shiftmarkers:
  if (missing(shiftmarkers)) shiftmarkers <- NA
  if (is.data.frame(shiftmarkers)) {
    if (sum(is.na(match(c("MarkerName", "shift"), names(shiftmarkers)))) > 0)
      stop("checkF1: shiftmarkers must have columns MarkerName and shift")
    shiftmarkers$shift[is.na(shiftmarkers$shift)] <- 0
    if (nrow(shiftmarkers) == 0) shiftmarkers <- NA else {
      if (sum(is.na(shiftmarkers$shift)) > 0 ||
          sum(!(shiftmarkers$shift %in% ((-ploidyF1):ploidyF1))) > 0)
        stop("checkF1: shiftmarkers contains invalid shift values")
      shiftmarkers$MarkerName <- as.character(shiftmarkers$MarkerName)
      if (length(unique(shiftmarkers$MarkerName)) < nrow(shiftmarkers))
        stop("checkF1: some markernames occur more than once in shiftmarkers")
    }
  }
  seg_invalidrate <- 0.03 #assumed percentage of scoring errors leading to invalid dosages
  file.del <- is.na(outfile) || outfile == ""
  if (file.del) outfile <- "checkF1.tmp"
  if (!checkFilename(outfile))
    stop(paste("checkF1: cannot write file", outfile))
  #fill in missing vectors and/or clean up:
  if (missing(parent1) || is.logical(parent1)) parent1 <- character(0)
  if (missing(parent2) || is.logical(parent2)) parent2 <- character(0)
  if (missing(ancestors) || is.logical(ancestors)) ancestors <- character(0)
  parent1 <- sort(as.character(parent1[!is.na(parent1)]))
  parent2 <- sort(as.character(parent2[!is.na(parent2)]))
  ancestors <- as.character(ancestors[!is.na(ancestors)])
  #select the relevant segtypes:
  if (!polysomic && !disomic && !mixed)
    stop("checkF1: at least one of polysomic, disomic and mixed must be TRUE")
  seginfo <- calcSegtypeInfo(ploidy, ploidy2)
  allsegtypenames <- names(seginfo)
  seginfo <- selSegtypeInfo(seginfo, polysomic, disomic, mixed,
                            selfing=checkSelfing(parent1, parent2))
  seginfoSummary <- segtypeInfoSummary(seginfo)

  # names of quality parameters:
  qnames <- c("q1_segtypefit"," q2_parents","q3_fracscored","qall_mult")
  if (length(critweight) == 3) qnames[5] <- "qall_weights"

  # check scores:
  if (scores_long) {
    if (anyNA(match(c("marker", "MarkerName", "SampleName", "geno"),
                    names(scores))))
      stop("checkF1: some required columns missing from scores")
    if (anyNA(scores$marker) || anyNA(scores$MarkerName) ||
        anyNA(scores$SampleName))
      stop("checkF1: no missing values allowed in columns marker, MarkerName or SampleName of scores")
    mrknames <- sort(as.character(unique(scores$MarkerName)))
  } else {
    if (length(names(scores)) < 2 ||
        !identical(c("marker", "MarkerName"), names(scores)[1:2]))
      stop("checkF1: first two columns of scores should be marker and MarkerName")
    if (anyNA(scores$marker) || anyNA(scores$MarkerName))
      stop("checkF1: no missing values allowed in columns marker and MarkerName of scores")
    mrknames <- as.character(scores$MarkerName)
  }

  createResultsdf <- function(mrkcount) {
    #function within checkF1
    saf <- getOption("stringsAsFactors")
    options(stringsAsFactors = FALSE)
    #create the results data frame for this batch:
    mat <- matrix(
      integer((2+ploidyF1+2+length(parent1)+length(parent2)+length(ancestors)) *
                mrkcount), nrow=mrkcount)
    colnames(mat) <- c("parent1", "parent2", paste("F1", 0:ploidyF1, sep="_"),
                       "F1_NA", parent1, parent2, ancestors)
    #segdf is the template for all individual segtype data frames
    #(sets of 3 columns, and one row for each marker):
    segdf <- data.frame(frqInvalid=character(mrkcount), #printed as formatted strings
                        Pvalue=character(mrkcount),
                        matchParent=factor(rep(NA,mrkcount),
                                           levels=c("No","OneOK","Unknown","Yes")))
    #bres is the data frame that will hold the results (one row per marker)
    bres <- data.frame(
      m=integer(mrkcount),
      MarkerName=character(mrkcount),
      mat
    )
    if (showAll) for (sg in 1:length(seginfo)) {
      df <- segdf
      names(df) <- paste(names(df), names(seginfo)[sg], sep="_")
      bres <- data.frame(bres, df)
    }
    segdf <- data.frame(
      fit = factor(rep(NA, mrkcount), levels=names(seginfo)),
      segdf
    )
    for (fi in c("bestfit", "bestParentfit")) {
      df <- segdf
      names(df) <- c(fi,
                     paste(names(df)[2:4], fi, sep="_"))
      bres <- data.frame(bres, df)
    }
    bres <- data.frame (
      bres,
      q1_segtypefit=character(mrkcount),
      q2_parents=character(mrkcount),
      q3_fracscored=character(mrkcount),
      qall_mult=character(mrkcount)
    )
    if (length(critweight) == 3)
      bres <- data.frame(
        bres,
        qall_weights=character(mrkcount)
      )
    if (is.data.frame(shiftmarkers))
      bres <- data.frame(
        bres,
        shift=integer(mrkcount))
    options(stringsAsFactors = saf)
    bres
  } #createResultsdf within checkF1

  segtypeBestSelcrit <- function(candidates) {
    #function within checkF1
    #candidates is a vector indexing the segtypes in results from which
    #           the one with the best selcrit must be selected
    #return value: index to that segtype, or 0 if none
    if (length(candidates) == 0) return(0)
    candSelcrit <- selcrit[candidates]
    candidates[which.max(candSelcrit)]
  } # segtypeBestSelcrit within checkF1

  compareFit <- function(newsegtype, oldsegtype) {
    #function within checkF1
    #newsegtype, oldsegtype: two indices into results
    #return value: TRUE if newsegtype is selected,
    #              FALSE if oldsegtype is selected
    (results$fracInvalid[newsegtype] <=
       max(0.05, 1.5 * results$fracInvalid[oldsegtype])) &&
      (results$Pvalue[newsegtype] >=
         min(0.01, 0.1 * results$Pvalue[oldsegtype]))
  } #compareFit within checkF1

  #subsetting from a big scores data.frame takes a lot of time
  #(approx 2 sec for each marker, in a file of ~26000 markers and 480 samples)
  #Therefore we subdivide it in batches of 100 markers
  #if not scores_long this probably is less relevant but doesn't do any harm
  batchsize <- 100
  batchnr <- 1
  while (batchsize * (batchnr-1) < length(mrknames)) {
    minmrk <- batchsize*(batchnr-1) + 1
    maxmrk <- min(length(mrknames), batchsize*batchnr)
    if (scores_long) {
      #batchscores <- subset(scores, subset=markername %in% mrknames[minmrk:maxmrk]) #only 260 * 2 sec
      batchscores <- scores[scores$MarkerName %in% mrknames[minmrk:maxmrk],] #only 260 * 2 sec
    } else {
      batchscores <- scores[scores$MarkerName %in% mrknames[minmrk:maxmrk],,
                            drop=FALSE]
    }
    bres <- createResultsdf(maxmrk - minmrk + 1) #bres: batchresults
    for (mrk in minmrk:maxmrk) {
      if (scores_long) {
        sc <- batchscores[batchscores$MarkerName == mrknames[mrk],] #very fast now
        if ("marker" %in% names(sc)) {
          mrknr <- sc$marker[1] #marker number in fitPoly input
        } else mrknr <- mrk
      } else {
        mrknr <- which(batchscores$MarkerName == mrknames[mrk])
        sc <- data.frame(SampleName=names(batchscores)[3:length(batchscores)],
                         geno=as.integer(batchscores[mrknr, 3:length(batchscores)]))
        mrknr <- batchscores$marker[mrknr] #marker number in fitPoly input
      }
      #get all the parent and ancestor genotypes in the order of the given vectors:
      parent.geno <- list()
      parent.geno[[1]] <- sc$geno[match(parent1, sc$SampleName)]
      parent.geno[[2]] <- sc$geno[match(parent2, sc$SampleName)]
      ancestors.geno <- sc$geno[match(ancestors, sc$SampleName)]
      F1.geno <- sc$geno[sc$SampleName %in% F1] #F1 does not have to be ordered

      shift <- 0
      if (is.data.frame(shiftmarkers)) {
        whichshift <- which(shiftmarkers$MarkerName == mrknames[mrk])
        if (length(whichshift) == 1) {
          shift <- shiftmarkers$shift[whichshift]
          if (shift != 0) {
            #changed: now all shifted dosages outside 0..ploidyF1 are set
            #to 0 or ploidyF1 (i.e. dosages that were separated are now merged),
            #before 20161130 they were set to NA
            #F1.geno <- ancestors.geno + shift
            #F1.geno[!(F1.geno %in% 0:ploidyF1)] <- NA
            F1.geno <- shiftdosages(F1.geno, shift, ploidyF1)
            if (shiftParents) {
              #also here new shiftdosages is used:
              parent.geno[[1]] <- shiftdosages(parent.geno[[1]], shift, ploidyF1)
              parent.geno[[2]] <- shiftdosages(parent.geno[[2]], shift, ploidyF1)
              ancestors.geno <- shiftdosages(ancestors.geno, shift, ploidyF1)
            }
          }
        }
      }

      #get the consensus genotype and conflict status of each of the two parents:
      par.geno <- c(0, 0) #integer
      par.lowconf.geno <- c(0, 0) #integer
      par.conflicts <- c(FALSE, FALSE) #logical
      par.NAfrac <- c(0.5, 0.5) #floatingpoint
      for (parent in 1:2) {
        #if (strict.parents) maxNAfrac <- 0.499 else maxNAfrac <- 1
        parresult <- getConsensusGeno(geno=parent.geno[[parent]],
                                      maxNAfrac=0.499,
                                      lowconf.NAfrac=0.751)
        par.geno[parent] <- parresult$geno
        par.lowconf.geno[parent] <- parresult$lowconf.geno
        par.conflicts[parent] <- parresult$conflict
        par.NAfrac[parent] <- parresult$NAfrac
      }

      #get the freq. distribution of the F1 samples:
      F1.naCount <- sum(is.na(F1.geno))
      F1.nobs <- length(F1.geno) - F1.naCount #the number of non-NA F1 samples
      #F1.counts <- hist(F1.geno, breaks=seq(-0.5, ploidyF1+0.5, by=1), plot=FALSE)$counts
      F1.counts <- tabulate(bin=F1.geno + 1, nbins=ploidyF1 + 1)
      #note that F1.counts[1] has the number of F1 samples with geno==0 etc !
      bestfit <- NA; bestParentfit <- NA;
      q <- rep(NA, length(qnames))
      results <- data.frame(segtype=names(seginfo),
                            fracInvalid=rep(1.0, length(seginfo)),
                            invalidP=rep(0.0, length(seginfo)),
                            Pvalue=rep(0.0, length(seginfo)),
                            matchParents=I(as.character(rep(NA,length(seginfo)))))
      if (F1.nobs > 10) {
        for (s in 1:length(seginfo)) {
          #does the current segtype match the parental genotypes?
          results$matchParents[s] <-
            getMatchParents(parGeno=par.geno, seginfoItem=seginfo[[s]])
          #calculate fraction invalid F1 scores given the current segtype:
          exp.geno <- seginfo[[s]]$expgeno #the genotypes that can occur given the segtype
          F1.invalid <-
            length(F1.geno[!(F1.geno %in% exp.geno)]) - F1.naCount #number of invalid scores
          results$fracInvalid[s] <- F1.invalid / F1.nobs
          #calculate chi-square P-value given the current segtype and invalidP value:
          if (F1.nobs - F1.invalid > 0) {
            #at least some F1 samples in expected peaks
            #(else Pvalue and invalidP remain 0)
            results$invalidP[s] <-
              pbinom(q=F1.nobs - F1.invalid,   #nr of valid scores
                     size=F1.nobs,             #total nr of scores
                     prob=1 - seg_invalidrate) #expected fraction of valid scores
            if (length(exp.geno) == 1) {
              results$Pvalue[s] <- 1.0
            } else {
              #more than one genotype expected, chisquare test
              if (sum(F1.counts[exp.geno + 1]) == 0) {
                #all observations are in invalid classes
                results$Pvalue[s] <- 0.0
                results$invalidP[s] <- 0.0
              } else {
                suppressWarnings(
                  results$Pvalue[s] <-
                    chisq.test(F1.counts[exp.geno+1],
                               p=seginfo[[s]]$freq[exp.geno+1])$p.value)
                #warnings probably generated by class counts < 5;
                #these do not worry us
                #as those P-values are likely to be very significant anyway
                #also the reported warnings are not informative
              }
            }
          }
        } # for (s in 1:length(seginfo))
        #select best fitting segtype based on fracInvalid and Pvalue
        #selcrit: selection criterion
        selcrit <- results$invalidP * results$Pvalue
        bestfit <- which.max(selcrit) #discards NA's
        if (bestfit == 0)
          stop(paste("Error in checkF1: bestfit is 0 at marker", mrknames[mrk]))

        #select which segtype best fits the parental genotype(s)
        ParentFit <- which(results$matchParents %in% c("Yes","OneOK","Unknown"))
        bestParentfit <- segtypeBestSelcrit(ParentFit)
        if (bestParentfit == 0)
          stop(paste("Error in checkF1: bestParentfit is 0 at marker", mrknames[mrk]))
        #bestParentfit is 0 if all segtypes have matchParents=="No" (should never occur)

        #Can we improve on bestParentfit by using low-confidence parental scores?
        lowParentFit <- NA
        lowc <- which(!is.na(par.lowconf.geno))
        if (length(lowc) > 0) {
          if (length(lowc) == 1) {
            if (is.na(par.geno[3 - lowc])) {
              #one parent missing, other lowconf
              low.segtypes <- seginfoSummary$segtypenr[
                seginfoSummary[, 2+lowc] == par.lowconf.geno[lowc]]
            } else {
              #one parent high conf, other lowconf score
              low.segtypes <- seginfoSummary$segtypenr[
                (seginfoSummary[, 2+lowc] == par.lowconf.geno[lowc]) &
                  (seginfoSummary[, 5-lowc] == par.geno[3-lowc])]
            }
            if (length(low.segtypes) > 0) {
              lowParentfit <- segtypeBestSelcrit(low.segtypes)
              #now we must check if lowParentfit is equal to or not
              #too much worse than bestParentfit: if so, we promote the
              #par.lowconf.geno to a true geno and lowParentfit to
              #bestParentfit
              if (compareFit(lowParentfit, bestParentfit)) {
                par.geno[lowc] <- par.lowconf.geno[lowc]
                bestParentfit <- lowParentfit
                results$matchParents[bestParentfit] <-
                  getMatchParents(parGeno=par.geno,
                                  seginfoItem=seginfo[[bestParentfit]])
              }
            }
          } else {
            #both parents have a low-conf score:
            #we should check for both scores if they are acceptable
            #(i.e. equal to, or not too much worse than bestfit)
            #If one of them is ok and not the other, we promote it and its lowParentfit;
            #if both pass and their combination is ok we promote both
            #and their combined lowParentfit;
            #if both pass but the combination does not pass we don't
            #promote either and keep bestfit as bestParentfit
            #First the combination:
            low.segtypes <- seginfoSummary$segtypenr[
              (seginfoSummary[, 3] == par.lowconf.geno[1]) &
                (seginfoSummary[, 4] == par.lowconf.geno[2])]
            lowParentfit <- segtypeBestSelcrit(low.segtypes)
            if (compareFit(lowParentfit, bestParentfit)) {
              #the combination of both par.lowconf.geno is acceptable
              par.geno <- par.lowconf.geno
              bestParentfit <- lowParentfit
              results$matchParents[bestParentfit] <-
                getMatchParents(parGeno=par.geno,
                                seginfoItem=seginfo[[bestParentfit]])
              #always "Yes"?
            } else {
              #the combination of both par.lowconf.geno is not acceptable,
              #check the two parents separately:
              lowParentfit <- c(0,0)
              for (p in 1:2) {
                low.segtypes <- seginfoSummary$segtypenr[
                  seginfoSummary[, 2+p] == par.lowconf.geno[p]]
                lowParentfit[p] <- segtypeBestSelcrit(low.segtypes)
                if (!compareFit(lowParentfit[p], bestParentfit))
                  lowParentfit[p] <- 0
              }
              p <- which(lowParentfit != 0)
              if (length(p) == 1) {
                #just one parent gives an acceptable lowParentfit, use that one:
                par.geno[p] <- par.lowconf.geno[p]
                bestParentfit <- lowParentfit[p]
                results$matchParents[bestParentfit] <-
                  getMatchParents(parGeno=par.geno,
                                  seginfoItem=seginfo[[bestParentfit]])
                #always "OneOK"?
              }
            } #two par.lowconf.geno, the combination was not acceptable
          } #two par.lowconf.geno
        } #at least one par.lowconf.geno

        #Note that we don't attempt here to resolve mismatches between the final
        #par.geno and bestParentfit, nor to fill in missing par.geno if the
        #bestParentfit segtype would allow only one solution.
        #That is done in compareProbes (and therefore also in writeDosagefile).
        #A further expansion to generate markers with all possible par.geno
        #allowed by the segtype and the known par.geno is done in
        #expandUnknownParents

        # generate a set of quality parameters of the marker in general and
        # the fit of bestParentfit:
        q <- calc_qall(Pvalue_threshold, fracInvalid_threshold, fracNA_threshold,
                       Pvalue=results$Pvalue[bestParentfit],
                       fracInvalid=results$fracInvalid[bestParentfit],
                       F1.NAfrac=F1.naCount / length(F1.geno),
                       matchParents=results$matchParents[bestParentfit],
                       bestfit=bestfit, bestParentfit=bestParentfit,
                       par.conflicts=par.conflicts,
                       par.NAfrac=par.NAfrac,
                       critweight=critweight,
                       parentsScoredWithF1=parentsScoredWithF1)
      } # F1.nobs>10

      #get the output line:
      bix <- mrk - minmrk + 1 #index to row in bres
      bres$m[bix] <- mrknr
      if (shift==0 || !append_shf) bres$MarkerName[bix] <- mrknames[mrk] else
        bres$MarkerName[bix] <- paste(mrknames[mrk], "shf", sep="_")
      bres[bix, 3:4] <- par.geno
      bres[bix, 5:(5+ploidyF1)] <- F1.counts
      bres$F1_NA[bix] <- F1.naCount
      startcol <- ploidyF1 + 7
      if (length(parent1) > 0) {
        bres[bix, startcol:(startcol-1+length(parent1))] <- parent.geno[[1]]
        startcol <- startcol + length(parent1)
      }
      if (length(parent2) > 0) {
        bres[bix, startcol:(startcol-1+length(parent2))] <- parent.geno[[2]]
        startcol <- startcol + length(parent2)
      }
      if (length(ancestors) > 0) {
        bres[bix, startcol:(startcol-1+length(ancestors))] <- ancestors.geno[[1]]
        startcol <- startcol + length(ancestors)
      }
      if (showAll) {
        bres[bix, startcol+seq(0, by=3, length.out=length(seginfo))] <-
          sprintf("%.4f", results$fracInvalid)
        bres[bix, startcol+seq(1, by=3, length.out=length(seginfo))] <-
          sprintf("%.4f", results$Pvalue)
        bres[bix, startcol+seq(2, by=3, length.out=length(seginfo))] <-
          results$matchParents
        startcol <- startcol + 3 * length(seginfo)
      }
      if (is.na(bestfit)) {
        bres[bix, startcol:(startcol+3)] <- rep(NA, 4)
      } else {
        bres[bix, startcol] <- results$segtype[bestfit]
        bres[bix, startcol+1] <- sprintf("%.4f", results$fracInvalid[bestfit])
        bres[bix, startcol+2] <- sprintf("%.4f", results$Pvalue[bestfit])
        bres[bix, startcol+3] <- results$matchParents[bestfit]
      }
      startcol <- startcol + 4
      if (is.na(bestParentfit)) {
        bres[bix, startcol:(startcol+3)] <- rep(NA, 4)
      } else {
        bres[bix, startcol] <- results$segtype[bestParentfit]
        bres[bix, startcol+1] <- sprintf("%.4f", results$fracInvalid[bestParentfit])
        bres[bix, startcol+2] <- sprintf("%.4f", results$Pvalue[bestParentfit])
        bres[bix, startcol+3] <- results$matchParents[bestParentfit]
      }
      startcol <- startcol + 4
      if (is.na(q[1])) {
        bres[bix, startcol:(startcol-1+length(q))] <- rep(NA, length(q))
      } else {
        bres[bix, startcol:(startcol-1+length(q))] <- sprintf("%.4f", q)
      }
      startcol <- startcol + length(q)
      if (is.data.frame(shiftmarkers)) bres[bix, startcol] <- shift
    } # for mrk
    if (batchnr == 1) {
      writeDatfile(bres, file=outfile)
    } else {
      writeDatfile(bres, file=outfile, append=TRUE, col.names=FALSE)
    }
    batchnr <- batchnr + 1
  } #while batch
  output <- readDatfile(outfile)
  output <- chk2integer(output)
  if (file.del) file.remove(outfile)
  invisible(output)
} #checkF1

#'@title Get substrings from the lefthand side
#'@description Get substrings from the lefthand side
#'@usage leftstr(x, n)
#'
#'@param x a character vector (or something having an as.character method)
#'@param n a single  number: if n>=0, the leftmost n characters of each element
#'of x are selected, if n<0 the (-n) rightmost characters are omitted
#'@return character vector with leftside substrings of x
#'@export
leftstr <- function(x, n) {
  #vectorized for x, not n
  #n>=0: take the leftmost n characters
  #n<0: take all but the rightmost (-n) characters
  if (n >= 0) {
    substr(x, 1, n) #automatically converts x to character if needed
  } else {
    #n < 0
    if (!is.character(x)) x <- as.character(x)
    len <- nchar(x)
    substr(x, 1, len + n)
  }
}

#'@title Get substrings from the righthand side
#'@description Get substrings from the righthand side
#'@usage rightstr(x, n)
#'
#'@param x a character vector (or something having an as.character method)
#'@param n a single  number: if n>=0, the rightmost n characters of each element
#'of x are selected, if n<0 the (-n) leftmost characters are omitted
#'@return character vector with rightside substrings of x
#'@export
rightstr <- function(x, n) {
  #vectorized for x, not n
  #n>=0: take the rightmost n characters
  #n<0: take all but the leftmost (-n) characters
  if (!is.character(x)) x <- as.character(x)
  len <- nchar(x)
  if (n >= 0) {
    start <- len - n + 1
    substr(x, start, len)
  } else {
    #n < 0
    substr(x, -n + 1, len)
  }
}

chk2integer <- function(chk) {
  # chk:    a data frame as returned by checkF1
  # result: the same data frame, but with all columns from Parent1 to
  #         last ancestor converted to integer (as these are sometimes factors
  #         with levels not equal to values after reading checkF1 from file)

  #find out the ploidy:
  F1cap <- names(chk)[which(names(chk) == "F1_NA") - 1]
  ploidyF1 <- as.integer(sub("F1_", "", F1cap))
  #find the column range to convert to integers:
  firstcol <- which(names(chk) == "parent1")
  #determine which column has the last parent or ancestor sample:
  #if 3 columns for each segtype are present (showAll TRUE), then
  #the first column after the last sample has name frqInvalid_<segtype> (where
  #<segtype> depends on the ploidy).
  #else (showAll FALSE) the next column has name "bestfit", followed
  #by "frqInvalid_bestfit"
  lastcol <- which(leftstr(names(chk), 10) == "frqInvalid")[1]
  if (rightstr(names(chk)[lastcol], 7) == "bestfit") {
    lastcol <- lastcol - 2
  } else {
    lastcol <- lastcol - 1
  }
  if (length(firstcol) != 1 || is.na(lastcol) ||
      lastcol - firstcol < 3 + ploidyF1) {
    stop("chk2integer: column names of chk incorrect")
  }
  for (i in firstcol:lastcol) {
    suppressWarnings(chk[[i]] <- as.integer(as.character(chk[[i]])))
    #warnings caused by NAs
  }
  chk
} #chk2integer

getPargeno <- function(P1consensus, P2consensus,
                       segtype, matchParents=NULL, seginfo,
                       allparentscores=FALSE) {
  #P1 consensus, P2consensus: the consensus scores of the two parents
  #segtype: the name of the selected segtype (selfit or (in the new polyploid
  #         version) bestParentfit from checkF1;
  #         should occur in names(seginfo); OR
  #         the number of the segtype in seginfo
  #matchParents: "Yes" if both consensus parental dosages match segtype
  #              (bestParentfit.matchParents from checkF1())
  #seginfo: a list as returned by calcSegtypeInfo(), with the par.dosages
  #         limited to those fitting the auto, allo and mixed parameters used to
  #         calculate checkF1
  #allparentscores: if FALSE (default) and more than one combination would be
  #         possible, c(NA, NA) is returned. If TRUE then a matrix with
  #         one row for each possibility and two columns for the 2 parental
  #         dosages is returned.
  #return value: an integer vector of two elements: the inferred dosages
  #         of parent 1 and 2; or (if allparentscores is TRUE and there are
  #         multiple combinations) a 2-column matrix with one row per
  #         parental combination

  # we try to fill in the expected parental dosage
  # based on the segtype and the consensus of the observed parental scores:
  parent.sc <- c(P1consensus, P2consensus)
  if (is.character(segtype)) {
    segtype <- which(names(seginfo) == segtype)
    if (length(segtype) != 1) stop("internal error in getPargeno")
  } #else segtype is already the number of the segtype
  exp.par <- seginfo[[segtype]]$pardosage #already selected to match
  #                                           auto/allo/mixed
  # now get the final parental dosage from consensus and segtype:
  if (nrow(exp.par) == 1) {
    # both parents must have the same genotype so we can fill these in,
    # irrespective of the observed parental genotype and the matchParent status:
    parent.sc <- exp.par[1,]
  } else {
    if (is.null(matchParents)) {
      matchParents <- getMatchParents(parGeno=parent.sc,
                                      seginfoItem=seginfo[[segtype]])
    }
    if (matchParents != "Yes") {
      # there is more than one possible parental combination.
      # if matchParents=="Yes" we don't need to do anything;
      # else (matchParents= Unknown, OneOK or No) we can only fill in the parents
      # if one parent matches one combination and the other matches none
      pmatch <- rep(NA, 2)
      for (p in 1:2) pmatch[p] <- match(parent.sc[p], exp.par[, p])
      if (sum(!is.na(pmatch)) == 1) {
        #one parent matches one expected parental combination and the other
        #doesn't match any
        r <- min(pmatch, na.rm=TRUE)
        parent.sc <- exp.par[r,]
      }
      else {
        # two possibilities:
        # - both parents each match a different row of exp.par
        # - none of the parents matches a row of exp.par
        # (if both parents would match the same row, matchParents would be Yes)
        # In both cases we cannot assign the parental genotypes:
        if (allparentscores) {
          parent.sc <- exp.par #a 2-column matrix with 2 or more rows
        } else parent.sc <- rep(NA, 2)
      }
    } #matchParents!="Yes"
  } #nrow(exp.par) == 1 else
  parent.sc
} #getPargeno

trim_shf <- function(mrknames) {
  #trim _shf from the marker names:
  if (is.factor(mrknames)) mrknames <- as.character(mrknames)
  if (is.character(mrknames)) {
    shf <- rightstr(mrknames, 4) == "_shf"
    mrknames[shf] <- leftstr(mrknames[shf], -4)
  }
  mrknames
} #trim_shf

makeCombscoresDf <- function(parent1, parent2, F1, ancestors=character(0),
                             other=character(0), nrow){
  #create an empty data frame with the correct size and column names and types
  #parent1 and parent2: character vectors with sampleIDs for parents 1 and 2
  #        (0, 1 or multiple samples per parent allowed, 0 samples are
  #        specified as character(0))
  #F1: character vector with sampleIDs for F1 individuals (one sample per
  #        F1 individual)
  #ancestors: character vector of other ancestor samples listed in the
  #        checkF1 output; character(0) if none
  #other: character vector with sampleIDs of any other samples; character(0) if none
  #nrow: the number of rows to create

  ncol <- length(parent1) + length(parent2) + length(ancestors) + 2 +
    length(F1) + length(other)
  scores <- matrix(rep(NA_integer_, nrow * ncol), nrow=nrow)
  df <- data.frame(
    MarkerName=rep("", nrow),
    segtype=rep("", nrow),
    scores,
    stringsAsFactors=FALSE
  )
  names(df) <- c("MarkerName", "segtype", parent1, parent2, ancestors,
                 "parent1", "parent2", F1, other)
  df
} #makeCombscoresDf

# writeDosagefile ***********************************************************
#'@title Write a file with segregation types and dosage scores
#'@description Write a file with for each marker the segregation type and the
#'dosage scores of the parental and ancestor samples, the parental consensus
#'dosages and the F1 samples.
#'@usage writeDosagefile(chk, scores, parent1, parent2, F1,
#'ancestors=character(0), other=character(0),
#'polysomic=TRUE, disomic=FALSE, mixed=FALSE,
#'ploidy, ploidy2, shiftParents, scorefile)
#'
#'@param chk data frame as returned by checkF1
#'@param scores data frame as read from the scores file produced by function
#'fitMarkers of package fitPoly (or a subset with at least columns
#'MarkerName, SampleName and geno)
#'@param parent1 character vector with the sample names of parent 1
#'@param parent2 character vector with the sample names of parent 2
#'@param F1 character vector with the sample names of the F1 individuals
#'@param ancestors character vector with the sample names of any other
#'ancestors
#'@param other other samples that should be treated like the F1
#'@param polysomic TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param disomic TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param mixed TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param ploidy the ploidy of parent 1 (must be even, 2 (diploid) or larger)
#'and the same as used by checkF1 to calculate the chk data frame
#'@param ploidy2 the ploidy of parent 2. If omitted it is assumed to be equal
#'to ploidy. Should be the same as used by checkF1 to calculate the chk
#'data frame
#'@param shiftParents if there is a column shift in chk the F1 dosages will be
#'shifted. If shiftParents is TRUE the parents and ancestors will be shifted
#'together with the F1, if FALSE only the F1 will be shifted in that case.\cr
#'If shiftParents is missing
#'or NA it will be set to TRUE except if ploidy2 != ploidy:
#'in that case this will result in an error (because it
#'may be that the parents are not genotyped or scored together with the
#'F1, the user should specify explicitly what to do)
#'@param scorefile filename for tab-separated text file with the dosages,;
#'if NA no file is written. For details of the contents see the return value
#'
#'@return A data frame with columns:
#'\itemize{
#'\item{MarkerName: the name of the marker}
#'\item{segtype: the segregation type}
#'\item{parental and ancestor samples: the dosages of those samples}
#'\item{parent1: the consensus dosage for parent1 as determined by checkF1}
#'\item{parent2: the consensus dosage for parent2 as determined by checkF1}
#'\item{F1 samples: the dosages for those samples}
#'\item{other samples: the dosages for those samples}
#'}
#'@export
writeDosagefile <- function(chk,
                            scores,
                            parent1,
                            parent2,
                            F1,
                            ancestors=character(0),
                            other=character(0),
                            polysomic=TRUE, disomic=FALSE, mixed=FALSE,
                            ploidy, ploidy2,
                            shiftParents,
                            scorefile) {
  #writeScoresfile uses compareProbes to just write the scores and segtypes
  #in the same format as compareProbes. It treats each marker separately
  #without comparing or combining two probes for the same SNP, but it takes
  #into account any shifts (from column shift in chk, if present).
  #It does so by calling compareProbes with probe.suffix=NA.
  if (missing(ploidy2)) ploidy2 <- ploidy
  if (missing(shiftParents)) shiftParents <- NA
  result <- compareProbes(
    chk=chk,
    scores=scores,
    probe.suffix=NA,
    fracdiff.threshold=0.04,
    parent1=parent1,
    parent2=parent2,
    F1=F1,
    ancestors=ancestors,
    other=other,
    polysomic=polysomic, disomic=disomic, mixed=mixed,
    ploidy=ploidy, ploidy2=ploidy2,
    shiftParents=shiftParents,
    qall_flavor="qall_mult",
    compfile=NA, combscorefile=scorefile)
  invisible(result$combscores)
} #writeDosagefile

# compareProbes ***************************************************************
#'@title Compare and combine results from two probes for the same SNP
#'@description On Affymetrix Axiom arrays it is possible to have two probes
#'interrogating the same SNP position. This function compares the dosage scores
#'and checkF1 results of the two probes; if they are sufficiently similar
#'a new marker is generated combining the results of the two probes. A dosage
#'file with the data for the separate probes as well as the combined markers
#'is written with the same format as writeDosagefile, and also a file
#'summarizing the comparison results.
#'
#'@usage compareProbes(chk, scores,
#'probe.suffix=c("P","Q","R"), fracdiff.threshold=0.04,
#'parent1, parent2, F1, ancestors=character(0), other=character(0),
#'polysomic=TRUE, disomic=FALSE, mixed=FALSE,
#'ploidy, ploidy2, qall_flavor="qall_mult", shiftParents,
#'compfile, combscorefile)
#'
#'@param chk data frame as returned by checkF1, or a subset with at least
#'columns markername, parent1, parent2 (the consensus parental genotypes),
#'the columns for the samples specified by parameters parent1, parent2 and
#'ancestors, and bestParentfit, and containing only rows with selected markers.
#'If a column  with a name as specified by qall_flavor (see below) is present
#'this will be written to file compfile, but it is not used: any selection of
#'marker based on qall (or other) must have been made beforehand, and the
#'rows for the unwanted markers must have been deleted from the chk data frame.\cr
#'For each marker*probe combination there may be an unshifted version
#'(shift==0), a shifted one (shift!=0), both, or neither.
#'\cr
#'If a column shift is present it will be used to shift the dosages
#'(and their P-values with them).
#'If some markernames end in "_shf" this part will be ignored, but the
#'P and Q suffixes (or alternatives as specified by probe.suffix) are
#'required to distinguish the two probes.
#'@param scores data frame as read from the scores file produced by function
#'fitMarkers of package fitPoly, with at least columns MarkerName,
#'SampleName, P0 .. P<ploidyF1> and geno (where <ploidyF1> is the ploidy of the
#'F1, i.e. the average of parental ploidy and ploidy2).\cr
#'If the F1 parents are scored separately, their rows should be added to the
#'scores data.frame for the F1 samples. If their ploidy is different from the
#'F1, the number of their P columns must be adjusted. The P data of the parents
#'are not used, they may all be set to NA.
#'@param probe.suffix a 3-item character vector specifying the suffixes of the
#'marker names that distinguish the two probes. The first two items identify
#'the two probes; the third item is used to indicate a new marker combining
#'the data from both probes. The three items must be different and have the
#'same number of characters default is c("P","Q","R")
#'@param fracdiff.threshold if more than this fraction of F1 scores differs
#'between probes, don't combine
#'@param parent1 character vector with the sample names of parent 1
#'@param parent2 character vector with the sample names of parent 2
#'@param F1 character vector with the sample names of the F1 individuals
#'@param ancestors character vector with the sample names of any other
#'ancestors
#'@param other other samples that should be treated like the F1
#'@param polysomic TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param disomic TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param mixed TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param ploidy the ploidy of parent 1 (must be even, 2 (diploid) or larger),
#'and the same as used by checkF1 to calculate the chk data frame
#'@param ploidy2 the ploidy of parent 2. If omitted it is assumed to be equal
#'to ploidy. Should be the same as used by checkF1 to calculate the chk
#'data frame
#'@param qall_flavor which quality parameter column must be shown in compfile,
#'default "qall_mult". If no quality data are wanted, specify "".
#'@param shiftParents if there is a column shift in chk the F1 dosages will be
#'shifted. If shiftParents is TRUE the parents and ancestors will be shifted
#'together with the F1, if FALSE only the F1 will be shifted in that case.\cr
#'If shiftParents is missing
#'or NA it will be set to TRUE except if ploidy2 != ploidy:
#'in that case this will result in an error (because it
#'may be that the parents are not genotyped or scored together with the
#'F1, the user should specify explicitly what to do)
#'@param compfile filename for tab-separated text file summarizing the
#'comparison results; if NA no file is written. For details of the contents see
#'the return value, component compstat
#'@param combscorefile filename for tab-separated text file with the dosages;
#'if NA no file is written. For details of the contents see the return value,
#'component combscores
#'
#'@details A combined marker is made in each case that a version of each of the
#'two probe markers is present and they are sufficiently similar. This means
#'that they have been assigned the same bestParentfit segregation type by
#'checkF1, and that the frequency of conflicting scores over all samples is
#'not more than fracdiff.threshold. The combined marker will have NA scores for
#'individuals where both probe markers are missing, the one available score if
#'it is scored for only one of the two probe markers or both scores are equal,
#'and the score with the highest P-value if the scores for both probe markers
#'are unequal.\cr
#'Any single-probe markers in chk that do not have a bestParentfit segregation
#'type are ignored and will not affect or appear in the output.
#'
#'@return A list with two components, compstat and combscores.\cr
#'compstat is a data frame with columns:
#'\itemize{
#'\item{MarkerName: name of the SNP marker. If a column shift is present in
#'data.frame chk, unshifted and shifted markers will get a "n" or "s" suffixed
#'to the MarkerName}
#'\item{segtypeP and segtypeQ: the segtype assigned by checkF1 to the first and
#'second probe}
#'\item{qallP and qallQ: the quality scores specified by parameter qall_flavor,
#'assigned by checkF1 to the two probes}
#'\item{countP and countQ: the number of versions of each of the probes
#'(0, 1, or 2, depending on whether a shifted, unshifted or both versions were
#'present)}
#'\item{countR: the number of combinations made of versions of the two probe
#'markers (one for each combination of a version of each of the two probe
#'markers, if they match well enough - see details)}
#'}
#'If the chk data frame contains a column shift, there are separate columns for
#'the non-shifted and shifted P and Q probe markers (suffix Pn, Ps, Qn, Qs),
#'and four columns for the R markers (suffix Rnn, Rns, Rsn, Rss where the first
#'n/s indicates if the P was non-shifted or shifted and the second n/s for the
#'Q probe.
#'combscores is a data frame with columns:
#'\itemize{
#'\item{MarkerName: the name of the marker. If the chk data frame contains a
#'column shift, the P and Q marker names are suffixed with n or s, and the R
#'marker names with nn, ns, sn, ss as described above}
#'\item{segtype: the segregation type}
#'\item{parental and ancestor samples: the dosages of those samples}
#'\item{parent1: the consensus dosage for parent1 as determined by checkF1}
#'\item{parent2: the consensus dosage for parent2 as determined by checkF1}
#'\item{F1 samples: the dosages for those samples}
#'\item{other samples: the dosages for those samples}
#'}
#'@export
compareProbes <- function(chk,
                          scores,
                          probe.suffix=c("P","Q","R"),
                          fracdiff.threshold=0.04,
                          parent1,
                          parent2,
                          F1,
                          ancestors=character(0),
                          other=character(0),
                          polysomic=TRUE, disomic=FALSE, mixed=FALSE,
                          ploidy, ploidy2,
                          qall_flavor="qall_mult",
                          shiftParents,
                          compfile, combscorefile) {
  #If probe.suffix is not NA but a set of 3 strings,
  #this function compares the results of the two probes for the same SNP (from
  #an Axiom array).
  #If probe.suffix is NA, this function only writes the scores from the
  #scores data frame and the segtypes from chk to combscorefile.
  #This option is used by function writeScoresfile, to avoid duplication a
  #lot of code and to obtain a scores file of the same format.


  #first some checks:
  noprobes <- length(probe.suffix) == 1 && is.na(probe.suffix)
  if (noprobes) funcname <- "writeScoresfile" else funcname <- "compareProbes"
  if (!noprobes &&
      (length(probe.suffix) != 3 || sum(is.na(probe.suffix)) > 0 ||
       max(nchar(probe.suffix)) != min(nchar(probe.suffix)) ||
       sum(abs(match(probe.suffix,probe.suffix) - c(1,2,3))) > 0))
    stop("compareProbes: probe.suffix invalid")
  if (ploidy %% 2 != 0)
    stop(paste(funcname, ": odd parental ploidy not allowed", sep=""))
  if (missing(ploidy2) || is.na(ploidy2)) ploidy2 <- ploidy else
    if (ploidy2 %% 2 != 0)
      stop(paste(funcname, ": odd parental ploidy2 not allowed", sep=""))
  ploidyF1 <- (ploidy + ploidy2) / 2
  segtypecol <- which(tolower(names(chk)) == "segtype")
  if (length(segtypecol) != 1)
    segtypecol <- which(tolower(names(chk)) == "selfit")
  if (length(segtypecol) != 1)
    segtypecol <- which(tolower(names(chk)) == "bestparentfit")
  if (length(segtypecol) != 1)
    stop(paste(funcname,
               ": chk should have one column segtype or bestParentfit",
               sep=""))
  if (!is.character(chk[, segtypecol]) )
    chk[, segtypecol] <- as.character(chk[, segtypecol])
  if (!is.character(chk$MarkerName))
    chk$MarkerName <- as.character(chk$MarkerName)
  #delete all rows without a segtype (mostly those that are not fitted by fitPoly):
  chk <- chk[!is.na(chk[, segtypecol]) & chk[, segtypecol] != "",]
  qallcol <- which(tolower(names(chk)) == qall_flavor)
  if (length(qallcol) > 1) qallcol <- qallcol[1]
  shiftspresent <- FALSE
  shiftcol <- which(tolower(names(chk)) == "shift")
  if (length(shiftcol) > 1) {#none is also allowed
    stop(paste(funcname, ": chk should have at most one column 'shift'"), sep="")
  } else if (length(shiftcol) == 0) {
    chk$shift <- rep(0, nrow(chk))
    shiftcol <- which(tolower(names(chk)) == "shift")
    shiftParents <- TRUE #if no shifts, shiftParents doesn't make a difference
  } else {
    #one existing column shift
    if (missing(shiftParents) || is.na(shiftParents)) {
      if (ploidy == ploidy2) shiftParents <- TRUE else
        stop(paste(funcname,
                   ": if ploidy2 != ploidy, shiftParents must be specified (FALSE/TRUE)",
                   sep=""))
      #note that this is not required if there is no existing column shift
    }
    if (anyNA(chk[, shiftcol]))
      stop(paste(funcname,
                 ": column 'shift' in chk may not contain missing values",
                 sep=""))
    shiftspresent <- TRUE
  }
  #check if all samples are present in scores and if their numbers match with chk:
  progenitors <- c(parent1, parent2, ancestors)
  progenitors <- progenitors[!is.na(progenitors)]
  if (length(progenitors) == 0) shiftParents <- FALSE # no progenitors to shift
  allsamp <- c(progenitors, F1, other)
  if (shiftParents) shfsamp <- rep(TRUE, length(allsamp)) else
    shfsamp <- c(rep(FALSE, length(progenitors)),
                 rep(TRUE, length(c(F1, other))))
  missamp <- allsamp[!(allsamp %in% unique(as.character(scores$SampleName)))]
  if (length(missamp) > 0)
    stop(paste(funcname, ": some samples not in scores:",
               paste(missamp, collapse=" ")))
  missamp <- progenitors[!(progenitors %in% names(chk))]
  if (length(missamp) > 0)
    stop(paste(funcname, ": some parent or ancestor samples not in chk:",
               paste(missamp, collapse=" ")))
  if (length(F1) != sum(chk[1, 5:(6+ploidyF1)]))
    stop(paste(funcname, ": F1 has", length(F1), "samples but in chk",
               sum(chk[1, 5:(6+ploidyF1)]), "values counted"))
  chk$SNPname <- trim_shf(chk$MarkerName)
  if (anyNA(match(chk$SNPname, unique(as.character(scores$MarkerName)))))
    stop(paste(funcname, ": not all markers from chk appear in scores", sep=""))

  if (noprobes) {
    #writeDosagefile
    chk$probe <- rep("", nrow(chk))
  } else {
    mnl <- nchar(chk$SNPname)
    chk$probe <- substr(chk$SNPname, mnl - nchar(probe.suffix[1]) + 1, mnl) #P or Q
    chk$SNPname <- substr(chk$SNPname, 1, mnl - nchar(probe.suffix[1])) #without P or Q
  }
  #now chk has some additional columns:
  #chk$shift: if it was not present, now has all 0's - the amount each marker is shifted
  #chk$probe: the probe suffix (P or Q by default)
  #probchk$SNPname: the original SNPname, without probe suffix or _shf
  #Note that the _shf suffix is not used, the shift value (0 vs not 0)
  #determines the shift status
  if (!noprobes && sum(!(chk$probe %in% probe.suffix[1:2])) > 0)
    stop("compareProbes: not all marker names have a specified probe suffix")
  if (!noprobes && sum(nchar(chk$SNPname) == 0) > 0)
    stop("compareProbes: not all marker names have a specified probe suffix")
  #make sure there is no locked file that will prevent writing the output:
  if (!is.na(compfile) && !checkFilename(compfile))
    stop(paste(funcname,": compfile not valid or not writable", sep=""))
  if (!is.na(combscorefile) && !checkFilename(combscorefile))
    stop(paste(funcname, ": combscorefile not valid or not writable", sep=""))

  if (is.factor(chk$parent1)) chk$parent1 <- as.integer(as.character(chk$parent1))
  if (is.factor(chk$parent2)) chk$parent2 <- as.integer(as.character(chk$parent2))
  if (is.factor(chk[,shiftcol]))
    chk[,shiftcol] <- as.integer(as.character(chk[,shiftcol]))
  if (is.factor(parent1)) parent1 <- as.character(parent1)
  if (is.factor(parent2)) parent2 <- as.character(parent2)
  if (is.factor(F1)) F1 <- as.character(F1)
  if (is.factor(ancestors)) ancestors <- as.character(ancestors)
  if (is.factor(other)) other <- as.character(other)

  snpnames <- sort(unique(chk$SNPname)) #SNPname is character

  #note that the two probes don't have the same set of SNPs, as some SNPs
  #may be rejected by fitPoly for one probe but not for the other,
  #and, if some markers are shifted, again this will not be the case for all.
  #we now make some arrays that for all SNPs
  #contain some data for each probe/shift combination, or NA where a probe was
  #rejected or the shifted or nonshifted situation did not apply
  segtype <- array(NA, dim=c(length(snpnames), 2, 2),
                   dimnames=list(
                     snpname=snpnames,
                     probe=c("probe1","probe2"),
                     shift=c("false","true")))
  chkshift <- segtype #for each marker/probe/shift the actual amount of shift
  qall <- segtype #for each marker/probe/shift the qall of the selected segtype (if any)
  cons.parent1 <- segtype #for each marker/probe/shift the consensus parent1 score (if any)
  cons.parent2 <- segtype #for each marker/probe/shift the consensus parent2 score (if any)
  Rsegtype <- segtype #for each comb of probe1 shifted/nonshifted and probe2 shifted/nonshifted
  #                    the combined segtype (R=P&Q) or NA
  #we fill 5 of these 6 3D arrays with data from chk, except (for the moment) Rsegtype:
  for (sh in 1:2) {
    if (sh == 1) sel.sh <- chk$shift == 0 else
      sel.sh <- chk$shift != 0
    if (noprobes)proberange <- 1 else proberange <- 1:2
    for (prb in proberange) {
      if (noprobes) {
        prbdat <- chk[sel.sh, ]
      } else {
        prbdat <- chk[chk$probe == probe.suffix[prb] & sel.sh, ]
      }
      rows <- match(prbdat$SNPname, snpnames)
      segtype[rows, prb, sh] <- prbdat[, segtypecol]
      chkshift[rows, prb, sh] <- prbdat[, shiftcol]
      if (length(qallcol) == 1) qall[rows, prb, sh] <- prbdat[, qallcol]
      cons.parent1[rows, prb, sh] <- prbdat$parent1
      cons.parent2[rows, prb, sh] <- prbdat$parent2
    }
  } #for sh
  seginfo <- selSegtypeInfo(calcSegtypeInfo(ploidy, ploidy2),
                            polysomic, disomic, mixed,
                            selfing=checkSelfing(parent1, parent2))
  if (sum(is.na(match(unique(chk[, segtypecol]),names(seginfo)))) > 0)
    stop("compareProbes: chk contains segtypes not expected with the settings\nof polysomic/disomic/mixed/ploidy/ploidy2")
  scores <- scores[scores$SampleName %in% allsamp,]
  #create an initial empty data frame to hold the combined scores;
  #this will be increased if needed
  combscores <- makeCombscoresDf(parent1=parent1, parent2=parent2, F1=F1,
                                 ancestors=ancestors, other=other,
                                 nrow=nrow(chk))
  combrow <- 0
  #subsetting from a big scores data.frame takes a lot of time
  #(approx 2 sec for each marker, in a file of ~26000 markers and 480 samples)
  #Therefore we subdivide it in batches of 200 SNPs
  batchsize <- 200
  batchcount <- (length(snpnames)-1) %/% batchsize + 1
  batchnr <- 1
  batchscores <- list()
  while (batchsize*(batchnr-1) < length(snpnames)) {
    cat(paste("batch", batchnr, "of", batchcount, "\n"))
    minmrk <- batchsize * (batchnr-1) + 1
    maxmrk <- min(length(snpnames), batchsize*batchnr)
    batchmarkers <- snpnames[minmrk:maxmrk]
    if (!noprobes) {
      batchmarkers <- c(paste(batchmarkers, probe.suffix[1], sep=""),
                        paste(batchmarkers, probe.suffix[2], sep=""))
      #note that for all snpnames both a P and a Q markername are created,
      #even if only one of the two in scores
    }
    batchscores <- scores[scores$MarkerName %in% batchmarkers, ]
    for (mrk in minmrk:maxmrk) {
      #for the current marker, list sc will hold the scores:
      #sc[[1]] and sc[[2]] are lists for probes P and Q (and sc[[3]] for R?)
      sc <- list()
      #the following 3D arrays are similar to the previously defined 6 3D arrays
      #except that here the 1st dim is samples while earlier it was markers
      progen.sc <- array(NA, dim=c(length(progenitors), 2, 2), #all progenitor samples scores
                         dimnames=list(
                           sample=progenitors,
                           probe=c("probe1","probe2"),
                           shift=c("FALSE","TRUE")))
      F1.sc <- array(NA, dim=c(length(F1)+length(other), 2, 2), #all F1 + other samples scores
                     dimnames=list(
                       sample=c(F1, other),
                       probe=c("probe1","probe2"),
                       shift=c("FALSE","TRUE")))
      parent.sc <- array(NA, dim=c(2, 2, 2), #both consensus parental scores
                         dimnames=list(
                           sample=c("parent1", "parent2"),
                           probe=c("probe1","probe2"),
                           shift=c("FALSE","TRUE")))
      P0col <- which(names(scores) == "P0")
      if (length(P0col) != 1)
        stop("compareProbes: scores should contain columns P0 .. P<ploidy>")
      for (prb in 1:2) if (any(!is.na(segtype[mrk, prb,]))) {
        #unshifted and/or shifted version present for this marker/probe
        if (noprobes) {
          mrkname_scores <- snpnames[mrk]
        } else {
          mrkname_scores <- paste(snpnames[mrk], probe.suffix[prb], sep="")
        }
        sc[[prb]] <- list()
        #sc[[prb]][[1]] gets the unshifted score data for the population samples
        #               i.e. all columns from the fitPoly output including
        #               P0..P<ploidyF1>  and geno,
        #               and all samples in parent1, parent2, ancestors, F1, other
        sc[[prb]][[1]] <- batchscores[batchscores$MarkerName == mrkname_scores &
                                        batchscores$SampleName %in% allsamp, ]
        sc[[prb]][[1]] <-
          sc[[prb]][[1]][match(allsamp, sc[[prb]][[1]]$SampleName),] #sorted
        #if this probe is shifted, sc[[prb]][[2]] gets the shifted score data:
        if (!is.na(segtype[mrk, prb, 2])) {
          sc[[prb]][[2]] <- sc[[prb]][[1]]
          #shift the geno values:
          sv <- chkshift[mrk, prb, 2] #shift value, positive or negative, not 0
          # The shift in the next lines must not be applied to progenitors
          # if shiftParents is FALSE; this is done with shfsamp
          sc[[prb]][[2]]$geno[shfsamp] <- sc[[prb]][[2]]$geno[shfsamp] + sv
          #old: geno shifted out of 0.. ploidyF1 made missing:
          #inval <- shfsamp & !(sc[[prb]][[2]]$geno %in% 0:ploidyF1)
          #sc[[prb]][[2]]$geno[inval] <- NA
          #new: geno shifted out of 0..ploidyF1 made equal to 0 or ploidyF1:
          inval <- shfsamp & sc[[prb]][[2]]$geno < 0
          sc[[prb]][[2]]$geno[inval] <- 0
          inval <- shfsamp & sc[[prb]][[2]]$geno > ploidyF1
          sc[[prb]][[2]]$geno[inval] <- ploidyF1

          #shift the P-value columns (only to adjust the Pmax)
          if (sv > 0) {
            #old: discard the P's of classes shifted beyond ploidyF1
            #sc[[prb]][[2]][shfsamp, P0col+(sv:ploidyF1)] <-
            #  sc[[prb]][[2]][shfsamp, P0col+(0:(ploidyF1-sv))]
            #new: sum all P's for classes that are merged into dosage==ploidy:
            sc[[prb]][[2]][shfsamp, P0col+ploidyF1] <-
              sum(sc[[prb]][[2]][shfsamp, P0col+((ploidyF1-sv):ploidyF1)])
            #and shift all other existing P values:
            if (sv < ploidy)
              sc[[prb]][[2]][shfsamp, P0col+(sv:(ploidyF1-1))] <-
                sc[[prb]][[2]][shfsamp, P0col+(0:(ploidyF1-1-sv))]
            #and set the P's for classes that are new to 0:
            sc[[prb]][[2]][shfsamp ,P0col+(0:(sv-1))] <- 0
          } else {
            #chkshift < 0
            #old: discard the P's of classes shifted beyond 0
            #sc[[prb]][[2]][shfsamp, P0col+(0:(ploidyF1+sv))] <-
            #  sc[[prb]][[2]][shfsamp, P0col+((-sv):ploidyF1)]
            #new: sum all P's for classes that are merged into dosage==0:
            sc[[prb]][[2]][shfsamp, P0col] <-
              sum(sc[[prb]][[2]][shfsamp, P0col+(0:(-sv))])
            #and shift all other existing P values:
            if ((-sv) < ploidyF1)
              sc[[prb]][[2]][shfsamp, P0col+(1:(ploidyF1+sv))] <-
                sc[[prb]][[2]][shfsamp, P0col+((1-sv):ploidyF1)]
            #and set the P's for classes that are new to 0:
            sc[[prb]][[2]][shfsamp, P0col+(ploidyF1+sv+1):ploidyF1] <- 0 #NA
          }
          suppressWarnings(
            #the warning occur when a sample is not genotyped for this marker
            #(its ratio was set to NA before the fitPoly run,
            #perhaps based on a too low signal intensity).
            #this results in sc[[prb]][[2]]$maxP being assigned -Inf;
            #that is not a problem later on
            sc[[prb]][[2]]$maxP <-
              apply(sc[[prb]][[2]][P0col+(0:ploidyF1)], 1, max, na.rm=TRUE)
          )
        }
        #Now we have a list sc of data frames for the scores per probe (P,Q=1,2),
        #without (1) or with (2) shift.
        #Note that only the P-value columns and the geno value are shifted,
        #not the maxgeno; and maxP is the max of the remaining P-values

        #next we extract geno values for subsets of samples
        #and write the markers:
        for (sh in 1:2) {
          if (!is.na(segtype[mrk, prb, sh])) {
            mrkname_out <- mrkname_scores
            if (shiftspresent) mrkname_out <-
                paste(mrkname_out, c("n", "s")[sh], sep="")
            #progen.sc[, prb, sh] <-
            #    sc[[prb]][[sh]]$geno[sc[[prb]][[sh]]$SampleName %in% progenitors]
            #adapted, as (for triploids) parents are not always scored
            progensc <- sc[[prb]][[sh]][sc[[prb]][[sh]]$SampleName %in% progenitors,]
            progen.sc[match(progensc$SampleName,progenitors), prb, sh] <-
              progensc$geno
            F1.sc[, prb, sh] <-
              sc[[prb]][[sh]]$geno[sc[[prb]][[sh]]$SampleName %in% c(F1, other)]
            parent.sc[, prb, sh] <- getPargeno(cons.parent1[mrk, prb, sh],
                                               cons.parent2[mrk, prb, sh],
                                               segtype=segtype[mrk, prb, sh],
                                               seginfo=seginfo)

            #now we have all data for this mrk / prb / shift combination,
            #we write it to the combscores data frame:
            combrow <- combrow+1
            if (combrow > nrow(combscores))
              combscores <- rbind(combscores,
                                  makeCombscoresDf(parent1=parent1, parent2=parent2,
                                                   F1=F1, ancestors=ancestors,
                                                   other=other, nrow=5000)
                                 )
            combscores$MarkerName[combrow] <- mrkname_out
            combscores$segtype[combrow] <- segtype[mrk, prb, sh]
            combscores[combrow, 3:length(combscores)] <-
              c(progen.sc[, prb, sh],
                parent.sc[, prb, sh],
                F1.sc[, prb, sh])
          }
        } #for sh
      } #for prb if !is.na(segtype[mrk, prb])

      #now we have written the scores per probe/sh to combscorefile for current marker mrk
      #finally we check for which markers we can produce combined scores for both probes:
      if (!noprobes) {
        numcomb <- 0
        for (sh1 in 1:2) for (sh2 in 1:2)
          if (!is.na(segtype[mrk, 1, sh1]) && !is.na(segtype[mrk, 2, sh2]) &&
              segtype[mrk, 1, sh1] == segtype[mrk, 2, sh2]) {
            bothscored <- sum(!is.na(sc[[1]][[sh1]]$geno) &
                                !is.na(sc[[2]][[sh2]]$geno))
            diff <- sc[[1]][[sh1]]$geno != sc[[2]][[sh2]]$geno
            Ndifferent <- sum(diff, na.rm=TRUE)
            fracdiff <- Ndifferent / bothscored
            #Note: this takes into account all samples if shiftParents TRUE,
            #and only F1 + other (i.e. non-progenitor) samples if shiftParents FALSE
            if (!is.na(fracdiff) && fracdiff <= fracdiff.threshold) {
              #combine the scores of both probes for this marker
              Rsegtype[mrk, sh1, sh2] <- segtype[mrk, 1, sh1]
              numcomb <- numcomb + 1 # it is possible that more than one combination
              #                        of shifts for the same SNP is possible
              #first fill in missing scores from sc[[2]]:
              combgeno <- sc[[1]][[sh1]]$geno
              w <- is.na(sc[[1]][[sh1]]$geno) & !is.na(sc[[2]][[sh2]]$geno)
              combgeno[w] <- sc[[2]][[sh2]]$geno[w]
              #next, for the different scores take the one with highest maxP:
              w <- !is.na(sc[[1]][[sh1]]$geno) &
                   !is.na(sc[[2]][[sh2]]$geno) &
                   sc[[2]][[sh2]]$geno != sc[[1]][[sh1]]$geno &
                   sc[[2]][[sh2]]$maxP > sc[[1]][[sh1]]$maxP
              combgeno[w] <- sc[[2]][[sh2]]$geno[w]
              #combine the parental consensus scores:
              parent.na <- matrix(NA, nrow=2, ncol=2)
              parent.na[1,] <- is.na(parent.sc[, 1, sh1])
              parent.na[2,] <- is.na(parent.sc[, 2, sh2])
              #parent.na has rows for probe 1 and 2, and columns for
              #parent 1 and 2, for the current set of sh1 and sh2
              combparent <- parent.sc[, 1, sh1] #both parental scores of probe 1
              #missing in probe 1 filled in from probe 2:
              w <- parent.na[1,] & !parent.na[2,]
              combparent[w] <- parent.sc[w, 2, sh2]
              #different in the two probes made missing (in contrast to sample
              #scores, for the consensus scores we don't have P-values to
              #select one):
              w <- !parent.na[1,] & !parent.na[2,] &
                   parent.sc[, 1, sh1] != parent.sc[, 2, sh2]
              combparent[w] <- NA
              #possible problem left: both probes have one (a different) parent
              #genotyped; in each is matchParent=OneOK, but the combination does
              #not match; then set both parents to NA:
              if (xor(parent.na[1,1], parent.na[1,2]) &&
                  xor(parent.na[2,1], parent.na[2,2]) &&
                  xor(parent.na[1,1], parent.na[2,1]) &&
                  getMatchParents(combparent, seginfo[[segtype[mrk, prb]]]) == "No") {
                #note that one can index a list (or vector) using the element names!
                combparent <- c(NA, NA)
              }
              if (shiftspresent) {
                mrkname_cmb <- paste(snpnames[mrk], "R",
                                     paste(c("n","s")[c(sh1, sh2)], collapse=""),
                                     sep="")
              } else {
                mrkname_cmb <- paste(snpnames[mrk], "R", sep="")
              }
              combrow <- combrow+1
              if (combrow > nrow(combscores))
                combscores <-
                  rbind(combscores,
                        makeCombscoresDf(parent1=parent1, parent2=parent2,
                                         F1=F1, ancestors=ancestors,
                                         other=other, nrow=5000)
                       )
              combscores$MarkerName[combrow] <- mrkname_cmb
              combscores$segtype[combrow] <- segtype[mrk, 1, sh1]
              combscores[combrow, 3:length(combscores)] <-
                c(combgeno[1:length(progenitors)],
                  combparent,
                  combgeno[(length(progenitors)+1):length(combgeno)])
            } #if fracdiff
          } #if segtypesame
      } #if !noprobes
    } #for mrk
    batchnr <- batchnr + 1
  } #while batch  probe.suffix[prb]
  if (combrow==0) combscores <- combscores[0,] else
    combscores <- combscores[1:combrow,]
  if (!is.na(combscorefile))
    writeDatfile(combscores, combscorefile)
  #finally we write an overview file:
  if (noprobes) {
    compstat <- NA
  } else {
    compstat <- data.frame(SNPname=snpnames)
    sufx <- matrix(c(paste(probe.suffix[1], "n", sep=""),
                     paste(probe.suffix[1], "s", sep=""),
                     paste(probe.suffix[2], "n", sep=""),
                     paste(probe.suffix[2], "s", sep="")), 2, 2, byrow=TRUE)
    max_sh <- 2
    Rsufx <- matrix(c("nn", "ns", "sn", "ss"), 2, 2, byrow=TRUE)
    if (!shiftspresent) {
      max_sh <- 1
      sufx <- substr(sufx, 1, 1)
      Rsufx <- matrix("")
    }
    for (prb in 1:2) for (sh in 1:max_sh) {
      compstat[[paste("segtype", sufx[prb, sh], sep="")]] <- segtype[, prb, sh]
      if (length(qallcol) == 1) {
        compstat[[paste("qall", sufx[prb, sh], sep="")]] <- qall[, prb, sh]
      }
    }
    for (sh1 in 1:max_sh) for (sh2 in 1:max_sh) {
      compstat[[paste("segtype", probe.suffix[3], Rsufx[sh1, sh2], sep="")]] <-
        Rsegtype[, sh1, sh2]
    }
    for (prb in 1:2) {
      compstat[[paste("count", probe.suffix[prb], sep="")]] <-
        rowSums(!is.na(segtype[, prb,]))
    }
    compstat[[paste("count", probe.suffix[3], sep="")]] <-
      apply(!is.na(Rsegtype), 1, sum)
    rownames(compstat) <- NULL
    if (!is.na(compfile))
      writeDatfile(compstat, compfile)
  } #if !noprobes
  invisible(list(combscores=combscores, compstat=compstat))
} # compareProbes

# removeRedundant ***************************************************************
#'@title Remove redundant single-probe markers
#'@description If for a SNP both probes produced very similar results, function
#'compareProbes generated an extra marker combining the results from both
#'probes. In that case the original single-probe markers are redundant and this
#'function removes those data from the compstat and combscores data frames.
#'
#'@usage removeRedundant(compstat, combscores, compfile, combscorefile)
#'
#'@param compstat a data.frame as returned by compareProbes in the
#'compstat item of the return value and in the compfile
#'@param combscores a data.frame as returned by compareProbes in the
#'@param compfile a filename to which the compstat part of the results is
#'written; if NA no file is written
#'@param combscorefile a filename to which the compscores part of the results is
#'written; if NA no file is written
#'
#'@return A list with two components, compstat and combscores. These are identical
#'to the parameters compstat and combscores, with the redundant single-probe
#'markers removed.For their contents see function compareProbes.
#'@export
removeRedundant <- function(compstat, combscores, compfile, combscorefile) {
  #make sure there is no locked file that will prevent writing the output:
  if (!is.na(compfile) && !checkFilename(compfile))
    stop("removeRedundant: compfile not valid or not writable")
  if (!is.na(combscorefile) && !checkFilename(combscorefile))
    stop("removeRedundant: combscorefile not valid or not writable")
  #get the probe.suffix letters:
  sw <- which(substr(names(compstat), 1, 7) == "segtype") #column numbers in compstat
  #make sure that empty segtype cells are NA:
  for (i in sw) {
    compstat[,i] <- as.character(compstat[,i])
    compstat[!is.na(compstat[,i]) & compstat[,i] == "", i] <- NA
    compstat[,i] <- factor(compstat[,i])
  }
  qw <- which(substr(names(compstat), 1, 4) == "qall")
  #note that qall columns may or may not be present
  sn <- names(compstat)[sw] #segtypePn ... segtypeRss, or segtypeP, segtypeQ, segtypeR
  shiftspresent <- length(sn) == 8 #2P, 2Q, 4R; else length(sn)==3: P, Q, R
  if (shiftspresent) {
    shift.suffix <- c("n", "s")
    probe.suffix <- c(substr(sn[1], 8, nchar(sn[1])-1), #P
                      substr(sn[3], 8, nchar(sn[3])-1), #Q
                      substr(sn[5], 8, nchar(sn[5])-2)) #R
    scol <- matrix(sw[1:6], 3, 2, byrow=TRUE) #probe(P/Q/R) in row, shift in col,
    #                           cell value is column nr in compstat
    #Note: for R there are 4 columns (nn/ns/sn/ss), here only 2 are listed,
    #but actually only the entry for the first R column is used
    row.names(scol) <- probe.suffix
    colnames(scol) <- shift.suffix
    if (length(qw) > 0) qcol <- matrix(qw, 2, 2, byrow=TRUE)
    max_sh <- 2 #shift 1: not shifted; shift 2: shifted
  } else {
    shift.suffix <- ""
    probe.suffix <- c(substr(sn[1], 8, nchar(sn[1])), #P
                      substr(sn[2], 8, nchar(sn[2])), #Q
                      substr(sn[3], 8, nchar(sn[3]))) #R
    scol <- matrix(sw, 3, 1) #col nrs of P/Q/R in compstat
    row.names(scol) <- probe.suffix
    if (length(qw) > 0) qcol <- matrix(qw, 2, 1, byrow=TRUE)
    max_sh <- 1
  }
  omitmrk <- character(0)
  for (sh1 in 1:max_sh) for (sh2 in 1:max_sh) {
    combsnp <- !is.na(compstat[, scol[3, 1] + 2*(sh1-1) + (sh2-1)]) # column of
    #          the R markers for this combination of P and Q shifts
    #for these markers, both the P and Q (with the same shift) are present and
    #must be omitted:
    if (sh1 != sh2) {
      #if one probe was shifted and the other not, the current combination is
      #the only possibly correct one and we omit also the other-shifted
      #versions of the single probe markers:
      omitmrk <- c(omitmrk,
                   paste(compstat$SNPname[combsnp],
                         probe.suffix[1], shift.suffix[sh1], sep=""),
                   paste(compstat$SNPname[combsnp],
                         probe.suffix[1], shift.suffix[sh2], sep=""),
                   paste(compstat$SNPname[combsnp],
                         probe.suffix[2], shift.suffix[sh1], sep=""),
                   paste(compstat$SNPname[combsnp],
                         probe.suffix[2], shift.suffix[sh2], sep=""))
      compstat[combsnp, scol[1, ]] <- NA
      compstat[combsnp, scol[2, ]] <- NA
      if (length(qw) > 0) {
        compstat[combsnp, qcol[1, ]] <- NA
        compstat[combsnp, qcol[2, ]] <- NA
      }
    } else {
      #the current combination has the same shift for both probes,
      #we delete only those versions of the probes because the other-shifted
      #version might actually be the correct one (and there the combination
      #might not have been made due to a lower or higher number of deleted
      #scores):
      omitmrk <- c(omitmrk,
                   paste(compstat$SNPname[combsnp],
                         probe.suffix[1], shift.suffix[sh1], sep=""),
                   paste(compstat$SNPname[combsnp],
                         probe.suffix[2], shift.suffix[sh2], sep=""))
      #erase the segtype and qall for the removed markers:
      compstat[combsnp, scol[1, sh1]] <- NA
      compstat[combsnp, scol[2, sh2]] <- NA
      if (length(qw) > 0) {
        compstat[combsnp, qcol[1, sh1]] <- NA
        compstat[combsnp, qcol[2, sh2]] <- NA
      }
    }
  }
  combscores <- combscores[!(combscores$MarkerName %in% omitmrk), ]
  if (!is.na(combscorefile))
    writeDatfile(combscores, combscorefile)
  #adjust the count columns in compstat:
  for (prb in 1:2) {
    compstat[[paste("count", probe.suffix[prb], sep="")]] <-
      rowSums(!is.na(compstat[, scol[prb,,drop=FALSE],drop=FALSE]))
  }
  if (shiftspresent) Rcols <- scol[3,1]+(0:3) else Rcols <- scol[3,1] #NOT +(0:1) ???
  compstat[[paste("count", probe.suffix[3], sep="")]] <-
    rowSums(!is.na(compstat[, Rcols, drop=FALSE]))
  if (!is.na(compfile))
    writeDatfile(compstat, compfile)
  invisible(list(combscores=combscores, compstat=compstat))
} #removeRedundant

#'@title Convert the F1 dosage scores data.frame or file to polymapR
#'input matrix
#'@description Functions writeDosagefile, compareProbes and removeRedundant
#'produce score files. This function reads such a file and converts the result
#'to a data.frame as needed by the polymapR package.
#'
#'@usage F1Dosages2Matrix(dosages, outfile=NA, dec=".", sep=",")
#'
#'@param dosages name of an F1 dosage scores file as produced by functions
#'writeDosagefile, compareProbes or removeRedundant, or a data frame
#'read from such a file
#'@param outfile (path and) name of an output file, default NA means that no
#'file is written. The file is written by function writeDatfile; normally using
#'the default parameters but if file extension is 'csv', quote is set to TRUE
#'and the specified dec and sep parameters are used
#'@param dec character to use as decimal separator in the output file,
#'default "."; only used if extension of outfile is 'csv' (otherwise ".")
#'@param sep character to use as field separator in the output file,
#'default ","; only used if extension of outfile is 'csv' (otherwise
#'tab character)
#'
#'@return A matrix as needed by the polymapR package: one row per marker,
#'one column per sample, containing integer scores (0 .. ploidy) or NAs.
#'Row names are marker names, column names are sample names; the first two
#'columns are the (consensus) scores of Parent1 and Parent2.\cr
#'If outfile is not NA also an output file is written: if the extension is
#''csv' a csv file is written with the specified decimal and field separators
#'(default '.' and ',') and with row and column names quoted, else a
#'tab-separated file.
#'@export
F1Dosages2Matrix <- function(dosages, outfile=NA,
                             dec=".", sep=",") {
  if (length(outfile) != 1) outfile <- outfile[1]
  if (is.null(outfile) || outfile == "") outfile <- NA
  if (!is.na(outfile) & !check.filename(outfile))
    stop(paste("F1Dosages2Matrix: cannot write file", outfile))
  if (is.character(dosages) && length(dosages) == 1)
    dosages <- readDatfile(dosages)
  if (!is.data.frame(dosages))
    stop("F1Dosages2Matrix: dosages must be a data.frame or a filename")
  parcol1 <- which(tolower(names(dosages)) == "parent1")
  pstr <- leftstr(names(dosages)[parcol1], -1) #parent in upper/lower case
  if (length(parcol1) > 1) {
    stop("F1Dosages2Matrix: dosages must have one column 'parent1'")
  } else if (length(parcol1) == 0) {
    parcol1 <- which(names(dosages) == "P1")
    if (length(parcol1) != 1)
      stop("F1Dosages2Matrix: dosages must have one column 'parent1'")
    pstr <- "P"
  }
  parcol2 <- which(names(dosages) == paste(pstr, 2, sep=""))
  if (length(parcol2) != 1 || parcol2 != parcol1 + 1)
    stop("F1Dosages2Matrix: column for parent1 and parent2 must be consecutive")
  dosages <- dosages[, -(2:(parcol1-1))]
  if (!is.na(outfile)) {
    if (tolower(tools::file_ext(outfile)) == "csv") {
      writeDatfile(dosages, outfile, quote = TRUE, sep=sep, dec=dec)
    } else writeDatfile(dosages, outfile)
  }
  mat <- as.matrix(dosages[, -1])
  rownames(mat) <- dosages[,1]
  invisible(mat)
} #F1Dosages2Matrix

# expandUnknownParents *******************************************************
#'@title Generate markers with all combinations of parental scores
#'@description For markers where the segregation type in the F1 is known but the
#'parental consensus scores are missing, this function generates multiple
#'versions of the marker, each with a different combination of parental scores
#'matching the segregation and a different suffix to the marker name.
#'
#'@usage expandUnknownParents(scores, sep="@",
#'polysomic=TRUE, disomic=FALSE, mixed=FALSE,
#'ploidy, ploidy2, scorefile)
#'
#'@param scores a data frame as returned by writeDosagefile, or the combscores
#'item in the return value of compareProbes or removeRedundant
#'@param sep a short string (one or more characters, default "@")
#'to separate the original marker name from the consecutive letters
#'(a, b, c etc) that identify the different versions of the marker. Markers
#'that already have parental scores are not modified
#'@param polysomic TRUE or FALSE; should be the same as used by checkF1
#'@param disomic TRUE or FALSE; should be the same as used by checkF1
#'@param mixed TRUE or FALSE; should be the same as used by checkF1
#'@param ploidy the ploidy of parent 1 (must be even, 2 (diploid) or larger),
#'and the same as used by checkF1
#'@param ploidy2 the ploidy of parent 2. If omitted it is assumed to be equal
#'to ploidy. Should be the same as used by checkF1
#'@param scorefile a filename to which the result is written;
#'if NA no file is written
#'
#'@return A data frame with the same format and contents as parameter scores,
#'with each marker where parental scores were missing expanded to
#'multiple rows, one per parental dosage combination
#'@export
expandUnknownParents <- function(scores,
                                 sep="@",
                                 polysomic=TRUE, disomic=FALSE, mixed=FALSE,
                                 ploidy, ploidy2,
                                 scorefile) {
  if (ploidy %% 2 != 0)
    stop("expandUnknownParents: odd parental ploidy not allowed")
  if (missing(ploidy2) || is.na(ploidy2)) ploidy2 <- ploidy else {
    if (ploidy2 %% 2 != 0)
      stop("expandUnknownParents: odd parental ploidy2 not allowed")
  }
  ploidyF1 <- (ploidy + ploidy2) / 2
  if (!polysomic && !disomic && !mixed)
    stop("expandUnknownParents: at least one of polysomic, disomic and mixed must be TRUE")
  if (!is.na(scorefile) && !checkFilename(scorefile))
    stop("expandUnknownParents: scorefile not valid or not writable")
  seginfo <- calcSegtypeInfo(ploidy, ploidy2)
  allsegtypenames <- names(seginfo)
  seginfo <- selSegtypeInfo(seginfo, polysomic, disomic, mixed)
  unpar <- which(is.na(scores$parent1) & is.na(scores$parent2))
  if (length(unpar) > 0) {
    #at least one marker needs to be expanded
    scores$MarkerName <- as.character(scores$MarkerName)
    scores$segtype <- as.character(scores$segtype)
    p12 <- which(names(scores) == "parent1")
    p12 <- p12:(p12+1)
    newscores <- scores[0,] # an empty data frame with same columns
    for (m in unpar) {
      mname <- scores$MarkerName[m]
      parscores <- seginfo[[scores$segtype[m]]]$pardosage
      #we put the first version into the old scores data frame:
      scores$MarkerName[m] <- paste(mname, "a", sep=sep)
      scores[m, p12] <- parscores[1,]
      #if there are more versions we add those to newscores:
      if (nrow(parscores) > 1) {
        mscores <- scores[rep(m, nrow(parscores)-1),] #copy the scores row several times
        mscores$MarkerName <- paste(mname, letters[2:nrow(parscores)], sep=sep)
        mscores[,p12] <- parscores[-1,]
        newscores <- rbind(newscores, mscores)
      }
    } # for m in unpar
    if (nrow(newscores) > 0) {
      scores <- rbind(scores, newscores)
      scores <- scores[order(scores$MarkerName),]
    }
  } #length(unpar)>0
  if (!is.na(scorefile)) writeDatfile(scores, scorefile)
  scores
} #expandUnknownParents


# checkFilename *************************************************************
#'@title Check if a file can be created
#'@description Checks if a file with that name can be created.
#'If successful, any pre-existing file does not exist any more.
#'
#'@usage checkFilename(filename, overwrite=TRUE)
#'
#'@param filename a file name with or without path. If the name contains a
#'path the entire path must already exist, else the file cannot be created
#'and the result is FALSE.
#'@param overwrite if TRUE an attempt is made to delete any pre-existing file
#'of that name; the function returns FALSE if the file is locked or the user
#'has no rights to delete the file. If FALSE the function returns FALSE
#'if any file of that name already exists, and no attempt is made to remove it
#'
#'@return TRUE if a new file can be created, else FALSE. If TRUE, no file of
#'that name exists (any more)
#'@export
checkFilename <- function(filename, overwrite=TRUE) {
  filename <- filename[1] # we test only the first filename
  if (!overwrite && file.exists(filename)) return(FALSE)
  tryCatch(
    { suppressWarnings({
      if (file.exists(filename)) file.remove(filename)
      if (file.create(filename)) {
        file.remove(filename)
        TRUE
      } else FALSE})
    }, error = function(e) { FALSE }
  )
} #checkFilename

# scores2wide *************************************************************
#'@title Convert a scores file from long to wide format
#'@description The fitMarkers function of package fitPoly returns a
#'scores file in "long" format: one row for each MarkerName / SampleName
#'combination. This function creates a file in "wide" format (samples in
#'columns, markers in rows) containing the data from the geno column of the
#'scores file. Faster, less memory-expensive and easier to use than reshape.
#'
#'@usage scores2wide(scores, outfile=NA)
#'
#'@param scores a data frame read from the scorefile output of saveMarkeModels
#'in package fitPoly. The order of the markers within each sample does not have
#'to be the same or v.v., there is no requirement for any ordering of scores.
#'However if scores is not ordered by MarkerName and SampleName and/or not all
#'markers are present for all samples (or v.v.) this results in a slower
#'processing.
#'@param outfile a file name to which the result is written. If NA no file
#'is written.
#'
#'@return A data frame with column names marker, MarkerName and all sample
#'names, and one row per marker. Marker is the sequential number of the marker
#'as reported in the scores data frame.
#'@export
scores2wide <- function(scores, outfile=NA) {
  outfile <- outfile[1]
  if (!is.na(outfile) && !checkFilename(outfile))
    stop("scores2wide: outfile is not a valid or writable file")
  #convert Markername and sample to factors, eliminating unused levels and
  #with levels sorted according to the current locale:
  markernrs <- scores$marker
  sampnames <- as.character(unique(scores$SampleName))
  mrknames <- as.character(unique(scores$MarkerName))
  #get the original marker(number) of each marker present:
  marknrs <- tapply(markernrs, factor(scores$MarkerName),
                    FUN=function(x) mean(x, na.rm=TRUE))
  rm(markernrs)
  #check if scores is complete and sorted:
  if (nrow(scores) > length(sampnames) * length(mrknames))
    stop("scores2wide: each combination of markername and sample may occur only once")
  df <- expand.grid(SampleName=sampnames, MarkerName=mrknames,
                    stringsAsFactors=FALSE) #needed!
  if (nrow(scores) == length(sampnames) * length(mrknames)) {
    #sort scores: it is important that within all samples all markers are in the
    #same order:
    if (sum(df$MarkerName != scores$MarkerName) > 0 ||
        sum(df$SampleName != scores$SampleName) > 0) {
      scores <- scores[order(scores$MarkerName, scores$SampleName),]
    }
  } else {
    #scores incomplete: not all combinations of markername and sample occur
    #(and some might be double, but we don't check for that)
    #we make it complete and sort it:
    scores <- merge(x=scores, y=df, all.y=TRUE, sort=TRUE)
  }
  #convert the geno column to wide format:
  wide <- matrix(integer(length(mrknames) * length(sampnames)),
                 nrow=length(mrknames), ncol=length(sampnames))
  keepcols <- match(c("MarkerName", "SampleName", "geno"), names(scores))
  batchsize <- 10 #there are usually far less samples than markers
  maxsamp <- 0
  while (maxsamp < length(sampnames)) {
    minsamp <- maxsamp + 1
    maxsamp <- min(minsamp + batchsize - 1, length(sampnames))
    batchscores <- scores[scores$SampleName %in% sampnames[minsamp:maxsamp],
                          keepcols]
    for (samp in minsamp:maxsamp) {
      sampscores <- batchscores[batchscores$SampleName == sampnames[samp],]
      # if (checksc) {
      #   missmrk <- setdiff(1:length(mrknames), as.integer(sampscores$MarkerName))
      #   if (length(missmrk) > 0) {
      #     sampscores <- sampscores[1:length(mrknames),] #extra rows with NA
      #     sampscores$MarkerName[(length(mrknames)-length(missmrk)+1):
      #                             length(mrknames)] <- mrknames[missmrk]
      #   }
      # }
      # if (samp == 1) {
      #   #check if markers in order of mrknames and if not, rearrange mrknames:
      #   marknrs <- sampscores$marker
      #   if (setequal(sampscores$MarkerName, mrknames)) {
      #     mrknames <- as.character(sampscores$MarkerName)
      #   } else stop ("scores2wide: all markers must be present for all samples and v.v.")
      # } else {
      #   #samp > 1, markers must be (put) in same order
      #   if (!setequal(sampscores$MarkerName, mrknames))
      #     stop ("scores2wide: all markers must be present for all samples and v.v.")
      #   if (sum(sampscores$MarkerName != mrknames) > 0)
      #     sampscores <- sampscores[match(mrknames, sampscores$MarkerName),]
      # }
      wide[, samp] <- sampscores$geno
    }
  }
  wide <- data.frame(marker = marknrs,
                     MarkerName = mrknames,
                     wide)
  colnames(wide) <- c("marker", "MarkerName", sampnames)
  rownames(wide) <- NULL
  if (!is.na(outfile)) {
    writeDatfile(wide, outfile)
  }
  invisible(wide)
} #scores2wide

# samplestats *************************************************************
#'@title Statistics for each F1 sample over all markers
#'@description Tabulate for each F1 sample in how many markers it has a
#'missing, invalid or valid score.
#'
#'@usage samplestats(chk, scores, F1, qall_flavor="qall_mult",
#'qall_threshold=0, ploidy, ploidy2, scores_long=TRUE)
#'
#'@param chk a data frame as returned by checkF1
#'@param scores A data frame as read from the scores file produced by function
#'fitMarkers of package fitPoly (or a subset with at least columns
#'MarkerName, SampleName and geno), or a data frame as returned by function
#'scores2wide. In the first case (default) parameter scores_long must be TRUE,
#'in the second case it must be FALSE.
#'@param F1 character vector with the sample names of the F1 individuals
#'@param qall_flavor which quality parameter column must be shown in compfile,
#'default "qall_mult". If no quality data are wanted, specify "".
#'@param qall_threshold only markers with a qall score > qall.threshold are
#'included in the tabulation
#'@param ploidy the ploidy of parent 1 (must be even, 2 (diploid) or larger),
#'and the same as used by checkF1 to calculate the chk data frame
#'@param ploidy2 the ploidy of parent 2. If omitted it is assumed to be equal
#'to ploidy. Should be the same as used by checkF1 to calculate the chk
#'data frame
#'@param scores_long TRUE if scores is in "long format", FALSE if it is in
#'"wide format" (see parameter scores)
#'
#'@return A matrix with samples in rows and 3 columns: missing, invalid, valid,
#'giving for each sample the number of markers where it has a missing, invalid
#'or valid dosage score
#'@export
samplestats <- function(chk, scores,
                        F1,
                        qall_flavor="qall_mult",
                        qall_threshold=0,
                        ploidy, ploidy2,
                        scores_long=TRUE) {
  if (ploidy %% 2 != 0) stop("samplestats: odd parental ploidy not allowed")
  if (missing(ploidy2) || is.na(ploidy2)) ploidy2 <- ploidy else
    if (ploidy2 %% 2 != 0) stop("samplestats: odd parental ploidy2 not allowed")
  ploidyF1 <- (ploidy + ploidy2) / 2
  if (scores_long) scores <- scores2wide(scores)
  #select and sort scores to have only the markers with qall>=threshold in chk,
  #with markers sorted as chk and individuals as F1
  #delete from chk all rows without a segtype (mostly those that are not fitted by fitPoly):
  chk <- chk[!is.na(chk$bestParentfit) & chk$bestParentfit != "",]
  qallcol <- which(tolower(names(chk)) == qall_flavor)
  if (length(qallcol) > 1) qallcol <- qallcol[1]
  chk <- chk[chk[,qallcol] > qall_threshold,]
  F1 <- intersect(F1, colnames(scores)) #keeps the order of F1
  scores_samp <- names(scores)[3:length(scores)] #ALL samples as ordered in scores
  scores <- scores[,c(1,2, match(F1, colnames(scores)))]
  mrk <- intersect(chk$MarkerName, scores$MarkerName) #keeps order of markers in chk
  scores <- scores[match(mrk, scores$MarkerName),]
  #calculate a list of valid dosage scores for each marker in chk:
  validscores <- list()
  validscores[[nrow(chk)]] <- integer(0)
  seginfo <- calcSegtypeInfo(ploidy, ploidy2)

  #nested function, to use in lapply
  get.expgeno <- function(segtypenr) { seginfo[[segtypenr]]$expgeno }

  mrk.segtype <- match(chk$bestParentfit, names(seginfo))
  if (sum(is.na(mrk.segtype)) > 0)
    stop("samplestats: not all segtypes in chk are valid, ploidy or ploidy2 are incorrect")
  mrk.expgeno <- lapply(as.list(mrk.segtype), FUN=get.expgeno)
  #calculate the totals:
  result <- matrix(integer(length(F1)*3), nrow=length(F1), ncol=3)
  rownames(result) <- F1
  colnames(result) <- c("missing", "invalid", "valid")
  for (i in seq_along(F1)) {
    sc <- scores[,i+2]
    result[i,1] <- sum(is.na(sc))
    result[i,3] <- sum(mapply("%in%", sc, mrk.expgeno), na.rm=TRUE)
  }
  result[,2] <- nrow(chk) - rowSums(result)
  result
} #samplestats

# selMarkers_qall *********************************************************
#'@title Sample markers at several qall levels
#'@description In order to get an idea of the quality of markers with
#'different levels of qall, this function samples marker numbers at a
#'specified set of qall values.
#'
#'@usage selMarkers_qall(chk, qall_levels, mrkperlevel=1,
#'qall_flavor="qall_mult")
#'
#'@param chk a data frame as returned by checkF1
#'@param qall_levels a numeric vector with the levels of the quality
#'parameter (qall_mult or qall_weights, see parameter qall_flavo)
#'at which markers should be selected
#'@param mrkperlevel the number of markers to select per level of qall
#'@param qall_flavor which qall to use: by default qall_mult,
#'also qall_weights is possible
#'
#'@return A subset of data frame chk with selected rows, in ascending order
#'of the qall_flavor column. The first <mrkperlevel> markers with qall >=
#'each qall.level are returned; duplicated selections are removed (so the
#'number of returned markers may be less than
#'length(qall_levels) * mrkperlevel)
#'@export
selMarkers_qall <- function(chk, qall_levels, mrkperlevel=1,
                            qall_flavor="qall_mult") {
  qallcol <- which(names(chk) == qall_flavor)
  if (length(qallcol) != 1) stop ("selMarkers_qall: invalid qall_flavor")
  chk <- chk[!is.na(chk[,qallcol]),]
  if (is.factor(chk[,qallcol])) chk[,qallcol] <- as.character(chk[,qallcol])
  if (is.character(chk[,qallcol])) {
    chk[,qallcol][chk[,qallcol] %in% c("", "NA")] <- NA
    chk[,qallcol] <- as.numeric(chk[,qallcol])
  }
  chk <- chk[order(chk[,qallcol]),]
  qall.levels <- sort(qall_levels)
  firstsel <- integer(0)
  for (i in seq_along(qall.levels)) {
    if (chk[nrow(chk), qallcol] < qall.levels[i]) {
      firstsel[i] <- nrow(chk) + 1 #no qall below threshold
    } else {
      firstsel[i] <- min(which(chk[,qallcol] >= qall.levels[i]))
    }
  }
  sel <- integer(0)
  for (i in 1:mrkperlevel) sel <- c(sel, firstsel + i - 1)
  sel <- sort(sel[!is.na(sel)])
  sel <- sel[sel <= nrow(chk)]
  sel <- sel[!duplicated(sel)]
  chk[sel,]
} #selMarkers_qall

# combineFiles *********************************************************
#'@title Combine the X and Y intensity scores and the assigned dosage in one file
#'@description This function combines the X and Y intensity values from the
#'fitPoly input with the geno (assigned dosage) from the fitPoly scores
#'output. Useful for producing XY scatterplots with samples colored
#'according to the assigned dosage.
#'
#'@usage combineFiles(XYdata, scores, controls=character(0))
#'
#'@param XYdata data.frame with (at least) columns MarkerName,
#'SampleName, X, Y (if present, R and ratio are also copied)
#'@param scores data.frame with scores produced by the
#'fitMarkers function of package fitPoly. It has columns MarkerName
#'and SampleName that are subsets of MarkerName and SampleName in the XYdata,
#'and at least a column geno
#'@param controls a character vector of sample names. The geno (dosage)
#'value of these samples is set to NA
#'@return a dataframe with columns MarkerName, SampleName, X, Y (and R and
#'ratio if these columns were present in XYdata), and geno, with all markers
#'and samples from XYdata. The value of geno is set to NA for all MarkerName /
#'sample combinations not present in scores, and for all samples in controls.
#'@export
combineFiles <- function(XYdata, scores, controls=character(0)) {
  #NOTE: $geno is NOT increased by 1 (as it was in earlier versions)
  if (!is.data.frame(XYdata) ||
      length(which(names(XYdata) %in% c("MarkerName", "SampleName", "X", "Y")))
      != 4)
    stop("XYdata must be a data.frame with at least columns MarkerName, SampleName, X and Y")
  XYcolumns <- match(c("MarkerName", "SampleName", "X", "Y", "R", "ratio"),
                     names(XYdata))
  XYcolumns <- XYcolumns[!is.na(XYcolumns)]
  XYdata <- XYdata[, XYcolumns]
  if (!is.data.frame(scores) ||
      length(which(names(scores) %in% c("MarkerName", "SampleName", "geno")))
      != 3)
    stop("scores must be a data.frame with at least columns MarkerName, SampleName and geno")
  scores <- data.frame(MarkerName=scores$MarkerName,
                       SampleName=scores$SampleName,
                       geno=scores$geno)
  alldat <- merge(XYdata, scores, by=c("MarkerName", "SampleName"), all=TRUE)
  alldat$MarkerName <- factor(alldat$MarkerName)
  alldat$geno[alldat$SampleName %in% controls] <- NA
  alldat$SampleName <- factor(alldat$SampleName)
  alldat
} #combineFiles

# correctDosages *********************************************************
calcPotShifts <- function(segtypes, ploidy) {
  #calculate potential shifts in dosages
  #segtypes: vector of segregation types
  #result: data.frame with columns
  #  segtype: the segtype code before the underscore i.e. without the
  #           first dosage class
  #  mindos: the first dosage class
  #  shift: -1, 0 or 1: potental shifts to try out
  # for each segtype whether a shift of +/- 1 should be
  #tried or not (0)
  #(only when number of dosages in segtype is <= ploidy-2 and
  #min.dosage==1 or max.dosage==ploidy-1)
  shift <- rep(0, length(segtypes))
  segtypes[is.na(segtypes) | segtypes==""] <- "x_0" #avoid problems with strsplit
  segtypes <- do.call(rbind, strsplit(as.character(segtypes), split="_"))
  numdos <- nchar(segtypes[,1]) #number of dosage classes in segtype
  compl <- substr(segtypes[,1],1,1) == "c"
  lencompl <- as.integer(substr(segtypes[compl, 1], 2, numdos[compl]-1))
  #           nr of dosages in complex segtype
  numdos[compl] <- lencompl
  mindos <- as.integer(segtypes[,2])
  #shift down: numdos <= ploidy-2 and lowest dosage = 1:
  shift[numdos <= ploidy-2 & mindos == 1] <- -1
  #shift up: numdos <= ploidy-2 and highest dosage = ploidy-1:
  maxdos <- mindos + numdos - 1
  shift[numdos <= ploidy-2 & maxdos == ploidy-1] <- 1
  data.frame(seg=segtypes[,1], mindos=mindos, shift=shift,
             stringsAsFactors=FALSE)
}

#'@title Check if dosage scores may have to be shifted
#'@description fitPoly sometimes uses a "shifted" model to assign dosage
#'scores (e.g. all samples are assigned a dosage one higher than the true
#'dosage). This happens mostly when there are only few dosages present
#'among the samples. This function checks if a shift of +/-1 is possible.
#'
#'@usage correctDosages(chk, scores, parent1, parent2, ploidy, ploidy2,
#'polysomic=TRUE, disomic=FALSE, mixed=FALSE, parentsScoredWithF1,
#'absent.threshold=0.04, outfile=NA)
#'
#'@param chk data frame returned by function checkF1 when called without
#'shiftmarkers
#'@param scores data.frame with scores as produced by the fitMarkers
#'function of package fitPoly; at least columns MarkerName, SampleName and geno
#'must be present, any other columns are ignored
#'@param parent1 character vector with names of the samples of parent 1
#'@param parent2 character vector with names of the samples of parent 2
#'@param ploidy The ploidy of parent 1 (must be even, 2 (diploid) or larger).
#'@param ploidy2 The ploidy of parent 2. If omitted it is
#'assumed to be equal to ploidy.
#'@param polysomic if TRUE at least all polysomic segtypes are considered;
#'if FALSE these are not specifically selected (but if e.g. disomic is TRUE,
#'any polysomic segtypes that are also disomic will still be considered);
#'same as used in the call to checkF1 that generated data.frame chk
#'@param disomic if TRUE at least all disomic segtypes are considered (see
#'param polysomic); same as used in the call to checkF1 that generated
#'data.frame chk
#'@param mixed if TRUE at least all mixed segtypes are considered (see
#'param polysomic). A mixed segtype occurs when inheritance in one parent is
#'polysomic (random chromosome pairing) and in the other parent disomic (fully
#'preferential chromosome pairing); same as used in the call to checkF1 that
#'generated data.frame chk
#'@param parentsScoredWithF1 single logical. TRUE means that parents are scored
#'in the same experiment and the same fitPoly run as the F1, else FALSE.
#'If missing and ploidy2==ploidy, TRUE is assumed.
#'If FALSE, parental scores will not be shifted along with the F1 scores.
#'@param absent.threshold the threshold for the fraction of ALL samples
#'that has the dosage that is assumed to be absent due to mis-fitting of
#'fitPoly; should be at least the assumed error rate of the fitPoly scoring
#'assuming the fitted model is correct
#'@param outfile file name to which the result is written. If NA no file
#'is written.
#'
#'@return a data frame with columns
#'\itemize{
#'\item{markername}
#'\item{segtype: the bestfit (not bestParentfit!) segtype from chk}
#'\item{parent1, parent2: the consensus parental dosages; possibly
#'low-confidence, so may be different from those reported in chk}
#'\item{shift: -1, 0 or 1: the amount by which this marker should be shifted}
#'}
#'The next fields are only calculated if shift is not 0:
#'\itemize{
#'\item{fracNotOk: the fraction of ALL samples that are in the dosage
#'(0 or ploidy) that should be empty if the marker is indeed shifted.}
#'\item{parNA: the number of parental dosages that is missing (0, 1 or 2)}
#'}
#'@details A shift of -1 (or +1) is proposed when (1) the fraction of all
#'samples with dosage 0 (or ploidy) is below absent.threshold, (2) the
#'bestfit (not bestParentfit!) segtype in chk has one empty dosage on the
#'low (or high) side and more than one empty dosage at the high (or low) side,
#'and (3) the shifted consensus parental dosages do not conflict with the
#'shifted segregation type.\cr
#'The returned data.frame (or a subset, e.g. based on the values in the
#'fracNotOk and parNA columns) can serve as parameter shiftmarkers in a
#'new call to checkF1.\cr
#'Based on the quality scores assigned by checkF1 to
#'the original and shifted versions of each marker the user can decide if
#'either or both should be kept. A data.frame combining selected rows
#'of the original and shifted versions of the checkF1 output (which may
#'contain both a shifted and an unshifted version of some markers) can then be
#'used as input to compareProbes or writeDosagefile.

#'@export
correctDosages <- function(chk, scores, parent1, parent2, ploidy, ploidy2,
                           polysomic=TRUE, disomic=FALSE, mixed=FALSE,
                           parentsScoredWithF1,
                           absent.threshold=0.04, outfile=NA) {
  #This function identifies markers that are probably misscored by fitPoly
  #(bestfit in 1_1, 1_3, 11_1, 11_2 (if ploidy==4) and parents in same dosage
  #classes;
  # note that 1_1 and 1_3 can only occur if disomic=TRUE in check.F1)
  #and produce data to suggests a revised scoring for F1 and parent, based on
  #the segregation in an F1 population and the scores in all other samples
  if (missing(ploidy2) || is.na(ploidy2)) {
    ploidy2 <- ploidy
    if (missing(parentsScoredWithF1)) parentsScoredWithF1 <- TRUE
  }  else
    if (ploidy2 %% 2 != 0) stop("checkF1: odd ploidy2 not allowed")
  ploidyF1 <- (ploidy + ploidy2) / 2

  if (is.na(outfile)) outfile <- ""
  if (outfile != "" && !checkFilename(outfile)) {
    stop("correctDosages: invalid outfile")
  }
  #if (ploidy %% 2 != 0)
  #  stop("correctDosages: only even ploidy allowed (parents must have been genotyped together with F1")
  if (sum(is.na(match(c(parent1, parent2), names(chk)))) > 0)
    stop("correctDosages: not all parental samples in names of chk")
  if ("shift" %in% names(chk))
    stop("correctDosages: chk must be calculated without shift")
  seginfo <- segtypeInfoSummary(calcSegtypeInfo(ploidy, ploidy2))
  seginfo <- seginfo[(polysomic & seginfo$par.poly==1) |
                       (disomic & seginfo$par.di==1) |
                       (mixed & seginfo$par.mixed==1),]
  chk <- chk2integer(chk)
  if (is.factor(chk$bestfit)) chk$bestfit <- as.character(chk$bestfit)
  NAcol <- which(names(chk) == "F1_NA") # last before parental samples

  #calculate a (possibly very low conf) new parental consensus:
  #(the one in chk may be based partly on bestParentfit)
  for (p in 1:2) {
    if (p==1) parent <- parent1 else parent <- parent2
    pc <- which(names(chk) == "parent1") - 1 + p
    chk[,pc] <- rep(NA, nrow(chk))
    if (length(parent) > 0) {
      #for calculating the consensus score (pcons) we need to calculate the
      #parallel min and max over the columns of the parental samples;
      pcons <- apply(chk[,(NAcol+1):(NAcol+length(parent)), drop=FALSE],
                     1, min) #no na.rm=TRUE !!
      rw <- !is.na(pcons) &  pcons ==
        apply(chk[,(NAcol+1):(NAcol+length(parent)), drop=FALSE], 1, max)
      chk[rw, pc] <- pcons[rw] #so the "old" parent1 and parent2 columns now
      #                         have a new type of consensus score
      NAcol <- NAcol + length(parent) #startpoint for parent2
    }
  }
  checksamp <- as.character(unique(scores$SampleName))
  if (!parentsScoredWithF1)
    checksamp <- setdiff(checksamp, c(parent1, parent2))
  scores <- scores[scores$MarkerName %in% as.character(chk$MarkerName) &
                     scores$SampleName %in% checksamp,]
  scores$MarkerName <- factor(scores$MarkerName) #drop unused levels
  nonNA <- function(x) sum(!is.na(x))
  is0 <- function(x) sum(x==0, na.rm=TRUE)
  isploidy <- function(x) sum(x==ploidyF1, na.rm=TRUE)
  scores.nonNA <- tapply(scores$geno, scores$MarkerName, FUN=nonNA)
  scores.frc0 <- tapply(scores$geno, scores$MarkerName, FUN=is0)
  scores.frc0 <- scores.frc0 / scores.nonNA
  scores.frcploidy <- tapply(scores$geno, scores$MarkerName, FUN=isploidy)
  scores.frcploidy <- scores.frcploidy / scores.nonNA
  scores.frc0 <- scores.frc0[order(names(scores.frc0))] #needed!
  scores.frcploidy <- scores.frcploidy[order(names(scores.frcploidy))] #needed!
  #these are now in alphabetical order; do the same with the chk data frame:
  ch.ord <- order(chk$MarkerName) #we need this later
  chk <- chk[ch.ord,]
  potShifts <- calcPotShifts(chk$bestfit, ploidyF1) #potential shift value of
  #                                                  segtype for each marker
  #for which rows do we need to check a potential shift?
  mf <- !is.na(potShifts$shift) &
    ( (potShifts$shift == -1 & scores.frc0 <= absent.threshold) |
        (potShifts$shift == 1 & scores.frcploidy <= absent.threshold) )
  if (parentsScoredWithF1) {
    shfpar <- cbind(chk$parent1 + potShifts$shift,
                    chk$parent2 + potShifts$shift)[mf,]
  } else shfpar <- cbind(chk$parent1, chk$parent2)[mf,]
  # (non)shifted parental consensus dosages, only for the mf markers
  shf_segtype <- paste(potShifts[mf, 1],
                       (potShifts[mf, 2] + potShifts[mf, 3]), sep= "_")
  #    shifted segtype: original first dosage + shift, only for the mf markers

  match_p_seg <- function(segtype, p1, p2) {
    # for one segtype: do the parental consensus dosages (p12) match it?
    if (is.na(p1) && is.na(p2)) return(TRUE)
    si <- seginfo[seginfo$segtype==segtype, 3:4] #parental dosage columns
    if (is.na(p1)) return(p2 %in% si[,2])
    #p1 now not NA
    if (is.na(p2)) return(p1 %in% si[,1])
    return(any(si[,1]==p1 & si[,2]==p2))
  }

  goodshifts <- as.logical(mapply(FUN=match_p_seg,
                                  shf_segtype, shfpar[,1], shfpar[,2]))
  #       as.logical needed because mapply returns an empty list if mf all FALSE
  shiftmf <- rep(0, sum(mf))
  shiftmf[goodshifts] <- potShifts$shift[mf][goodshifts]
  p1namf <- p2namf <- onepokmf <- bothpokmf <- bothpnamf <- rep(NA, sum(mf))
  p1namf[goodshifts] <- is.na(shfpar[goodshifts, 1])
  p2namf[goodshifts] <- is.na(shfpar[goodshifts, 2])
  nmrk <- nrow(chk)
  P1na = rep(NA, nmrk)
  P2na = rep(NA, nmrk)
  result <- data.frame(
    MarkerName = chk$MarkerName,
    segtype = chk$bestfit,
    parent1 = chk$parent1,
    parent2 = chk$parent2,
    shift = rep(0, nmrk),
    fracNotOk = rep(NA, nmrk) #fraction of all samples in the forbidden class
    #                           (0 or ploidyF1)
  )
  result$shift[mf] <- shiftmf
  #the next columns will get non-NA values only for accepted shifts
  result$fracNotOk[result$shift == -1] <- scores.frc0[result$shift == -1]
  result$fracNotOk[result$shift == 1] <- scores.frcploidy[result$shift == 1]
  P1na[mf] <- p1namf
  P2na[mf] <- p2namf
  result$ParNA <- P1na + P2na #0, 1 or 2

  #sort result into same order as original chk:
  result <- result[order(ch.ord),]
  if (outfile != "")
    writeDatfile(result, outfile)
  invisible(result)
} #correctDosages

#Functions to recover from a fitMarkers crash when the temporary batchfiles
#(RData files) are still there:

#'@title get the names of all batch files present
#'@description This function gets the names of all batch files already produced
#'by fitMarkers; used to recover data after a crash.
#'@usage getBatchFiles(filePrefix, set)
#'@param filePrefix the same filePrefix as used in the fitMarkers call
#'@param set one of "models" or "scores"
#'@return a vector of batch file names, either the scores set or the models set
#'@export
getBatchFiles <- function(filePrefix, set) {
  #filePrefix is the same as that for fitMarkers, including path if needed
  #set should be one of "models" or "scores"
  #result: a list of all temporary batch-filenames for either scores or models
  allfiles <- dir()
  fname_start <- paste(filePrefix, "_sMM", set, sep="")
  len <- nchar(fname_start)
  sort(allfiles[substr(allfiles,1,len) == fname_start])
} #getBatchFiles

#'@title Construct the log, models and scores files from a set of batch files
#'@description Get the data saved by fitMarkers in batch files before it
#'crashed and construct the log, models and score files just as fitMarkers
#'would have done
#'@usage concatbatch(batchfiles)
#'@param batchfiles a vector of batch file names as returned by getBatchFiles
#'@details all batch files are assumed to contain a list with the same number
#'of element, each either a character vector or a data.frame. This function
#'concatenates the elements across the batch files.\cr
#'This may be useful if fitMarkers has already been running a long time
#'and then crashed. The partial (or perhaps complete) data can then be
#'recovered from the saved batch files, so that only the remaining markers
#'(if any) need to be processed afterwards.
#'@return a list with the concatenated elements: character vectors and/or
#'data.frames
#'@export
concatbatch <- function(batchfiles) {
  #batchfiles is a vector of filenames as returned by getBatchFiles
  #each of the files in batchfiles are assumed to contain a list; in each case
  #with the same elements (data.frames and/or vectors)
  #The result is a similar list where each element is a concatenation
  #(with rbind() or c()) of the list elements from the batchfiles
  #These can then be saved to files, after sorting if desired. They contain
  #the same information as the output files of fitMarkers, but only for
  #the completed batches.
  vars <- load(batchfiles[1])
  if (length(vars) != 1) stop("batchfiles not supported")
  allbat <- get(vars)
  if (!is.list(allbat)) stop("batchfiles not supported")
  for (f in 2:length(batchfiles)) {
    print(paste("f =", f))
    vf <- load(batchfiles[f])
    if (length(vf) != 1) stop("batchfiles different types")
    nwbat <- get(vf)
    if (!is.list(nwbat) || !identical(names(allbat), names(nwbat)))
      stop("batchfiles different types")
    for (i in 1:length(nwbat)) {
      if (is.data.frame(nwbat[[i]])) {
        allbat[[i]] <- rbind(allbat[[i]], nwbat[[i]])
      } else if (is.vector(nwbat[[i]]))
        allbat[[i]] <- c(allbat[[i]], nwbat[[i]])
    }
  }
  allbat
}

Try the fitPoly package in your browser

Any scripts or data that you put into this service are public.

fitPoly documentation built on April 3, 2025, 8:58 p.m.