Nothing
#'fitPolyTools: a package with functions related to package fitPoly
#'
#'Package fitPolyTools offers some functions for preparing and selecting
#'input data for package fitPoly, and for inspection and analysis
#'of polyploid dosage calls as produced by fitPoly. These analyses focus
#'on the segregation of dosages in an F1 progeny of two parents. The functions
#'can be grouped as follows:
#'
#'@section data preparation functions:
#'Functions readFullDataTable and readAxiomSummary convert files generated by
#'Illumina's GenomeStudio and Affymetrix Power Tools software to the input
#'format for fitPoly::fitPoly and fitPoly::fitMarkers.
#'
#'@section data visualization functions:
#'Functions XYplot and drawXYplots draw scatterplots of the samples for one
#'marker at a time, possibly with colors indicating the assigned dosages and/or
#'highlighting specific samples. Function combineFiles combines X and Y columns
#'from one file or data.frame with column geno from another.
#'
#'@section segregation type functions:
#'The functions calcSegtypeInfo, selSegtypeInfo and segtypeInfoSummary
#'produce information on the possible F1 segregation types. These functions
#'are primarily developed for use in other functions but are also available
#'for users.
#'
#'@section identification of the probable segregation types:
#'Function checkF1 gives for all markers the most likely (best fitting)
#'segregation type, based on the F1 and the parental samples. Function
#'correctDosages checks for possibly "shifted" markers (e.g. all dosages
#'scored one lower or higher than the true dosage).
#' Function checkF1 can apply shifts when specified.
#'
#'@section combining data from two probes for the same SNP:
#'On Affymetrix Axiom SNP arrays it is possible to have the same SNP
#'interrogated from both sides by two different probes. Function compareProbes
#'compares the results from both probes and produces a new, merged marker if
#'they are sufficiently similar. It also writes the results to a data file
#'suitable for linkage mapping functions (package polymapR).\cr
#'Function removeRedundant takes the result of compareProbes; if a merged
#'marker is present it removes the corresponding separate probe markers.
#'
#'@section producing a data file for mapping:
#'Function writeDosagefile produces a data file suitable for linkage mapping
#'functions (package polymapR). This data file has the same format as that
#'produced by compareProbes and removeRedundant (see previous section).\cr
#'Function F1Dosages2Matrix reads such a dosages file and returns a matrix
#'in the format required by the polymapR package.
#'
#'@section reformatting of fitPoly scores file:
#'Function scores2wide reformats the scores file produced by the
#'fitMarkers function of the fitPoly package to "wide" format.
#'
#'@section recovery functions:
#'Functions getBatchfiles and concatbatch allow to recover results from already
#'processed batches of markers, in case fitMarkers crashes.
#'
#'@section general tools:
#'These tools may be useful by themselves and are therefore made available too.
#'They include leftstr and rightstr: two functions that return substrings from
#'the left or right side, and function checkFilename which tests if a file
#'with the name can be created. Functions readDatfile and writeDatfile are
#'wrappers for read.table and write.table with different default options,
#'suitable for the file formats expected or produced by functions in this
#'package oand fitPoly.
#'
#'@docType package
#'@name fitPolyTools
NULL
# Roeland Voorrips, Wageningen UR - Plant Breeding
# fitPolyTools contains functions for
# - conversion of data files from array software formats to fitPoly input format
# - drawing XY-scatterplots with or without genotype dosage colors
# - checking dosage scores against expected F1 segregation patterns
# - detecting and correcting "shifted" markers
# - combining scores from two probes (assays) of the same SNP
# - writing genotype files for F1 populations
# - recovering data from temporary files of fitPoly::fitMarkers
# in case it crashes
#'@importFrom grDevices col2rgb dev.off dev.list png rgb
#'@importFrom graphics layout lines plot points
#'@importFrom stats chisq.test pbinom quantile
#'@importFrom utils read.table write.table
#'@title A data set containing SNP array data
#'@description A data frame containing SNP array signal intensity data
#'@details XYdat contains the following columns:
#'\itemize{
#' \item MarkerName: factor, names of the markers (SNPs)
#' \item SampleName: factor names of the individuals / DNA samples
#' \item X: numeric, the signal intensities of the X allele
#' \item Y: numeric, the signal intensities of the Y allele
#' \item R: numeric, the total signal intensities; R = X + Y
#' \item ratio: numeric, the ratios of the Y signal to the total signal
#' intensity; ratio = Y / R
#'}
#'@name XYdat
#'@usage data(XYdat)
#'@docType data
NULL
#'@title A data set with dosage scores generated by fitPoly
#'@description A data frame as saved by function fitMarkers in package
#'fitPoly
#'@details scores contains the following columns:
#'\itemize{
#' \item marker: numeric, the sequential number of the markers in the
#' fitMarkers input file
#' \item MarkerName: factor, names of the markers (SNPs)
#' \item SampleName: factor names of the individuals / DNA samples
#' \item ratio: numeric, the ratios of the Y signal to the total signal
#' intensity
#' \item P0 ... P4: numeric, the probabilities that the observed signal ratio
#' reflects a true Y allele dosage of 0 ... 4, according to the selected
#' mixture model
#' \item maxgeno: numeric, the dosage with the maximum probability
#' \item maxP: numeric, the maximum probability, i.e. the probability of maxgeno
#' \item geno: numeric, equal to maxgeno if maxP was above the specified threshold
#' (p.threshold=0.99), else NA
#'}
#'@name scores
#'@usage data(scores)
#'@docType data
NULL
# Read data from SNP arrays **************************************************
#'@title convert a GenomeStudio FullDataTable file to the import format
#'for fitPoly
#'@description A GenomeStudio file in wide format (samples side-by-side) is
#'converted to a fitPoly input file in long format
#'@usage readFullDataTable(filename, rawXY=FALSE,
#'markergroups=list(), out, filetype=c("dat","RData")[2])
#'@param filename name of a FullDataTable tab-separated text file exported from
#'Illumina's GenomeStudio. The file must contain a column "Name" with the
#'marker names, and for each sample a pair of columns "sample.X" and "sample.Y"
#'if rawXY is FALSE, or "sample.X raw" and "sample.Y raw" (note the space)
#'if rawXY is TRUE. Further columns may be present but are not read.
#'@param rawXY if FALSE (default) the normalized .X and .Y columns are read;
#'if TRUE the "raw" columns (.X raw and .Y raw) are read instead.
#'@param markergroups a list with character vectors of marker names, or integer
#'vectors of marker numbers in file order. If the data set is large, the
#'conversion to long format may exceed memory limits. In these cases the data
#'can be split into marker groups that are converted separately and each saved
#'to a separate file. If the list is empty (default) all markers are converted
#'as one block.
#'@param out the name of an output file (without extension). If a list of
#'markergroups is given, out must be a valid file name (without extension);
#'in that case multiple output files are created with filenames in which
#'the list element numbers are appended to out. If no markergroups are specified
#'out may also be set to "" or NA; in that case no file is created and the
#'converted data are only returned as function result.\cr
#'If out is not "" or NA, then also a file <out>_meanR.dat is saved with for
#'all samples their mean R value and number of missing data (over all markers)
#'@param filetype either "dat" or "RData" (default): the former produces
#'tab-separated text files, the latter saves RData files with the converted
#'data in a data frame with name "dat".
#'@details The wide-format input is converted to a long-format form with
#'columns MarkerName, SampleName, X, Y, R (= X + Y) and ratio (= Y / R).
#'The X and Y signal intensities are obtained from the <sample>.X and <sample>.Y
#'columns in the input data (or from the <sample>.X raw and <sample>.Y raw
#'columns if rawXY is TRUE). R and ratio are calculated from these values
#'and not read from the input data.
#'@return If no markergroups are specified, a data.frame is returned with
#'columns MarkerName, SampleName, X and Y (also if raw data are read, the
#'column names are X and Y), R (= X + Y), ratio (= Y / (X+Y) ).\cr
#'If a list of markergroups is specified the function result is NULL
#'and the converted data.frames are only saved as files.\cr
#'If the saved files are RData files, they all contain one data.frame named
#'"dat".
#'@export
readFullDataTable <- function(filename, rawXY=FALSE,
markergroups=list(),
out, filetype=c("dat","RData")[2]) {
#Here, in contrast to readAxiomSummary, we don't allow to pass a data.frame
#instead of the filename, because we need to decide which columns to read
#test input:
if (!is.list(markergroups)) {
stop("markergroups is not list")
}
if (missing(out)) stop("error in readFullDataTable: out must be specified")
if (length(out) != 1) stop("out must be one filename or NA")
if (!is.na(out)) {
if (!is.character(out)) {
stop("out must be one filename")
}
if (is.na(out) || out == "") {
out <- NA
if (length(markergroups) > 0) {
stop("when markergroups is not empty, out must be a valid filename")
}
} else {
if (!check.filename(paste(out,"714x996y8.xxx",sep="_"))) {
stop("out is not a valid filename")
}
}
}
filetype <- tolower(filetype)
if (!is.na(out) && filetype != "dat" && filetype != "rdata") {
stop("filetype must be 'dat' or 'RData'")
}
#first we read the caption line, to select the columns to read and to
#get the unmodified sample names:
captions <- scan(file=filename, what="character", allowEscapes=F, nlines=1,
quiet=TRUE, sep="\t")
namecol <- which(captions == "Name")
nch <- nchar(captions)
if (rawXY) {
right <- substr(captions, nch-5, nch)
Xcols <- which(right == ".X raw")
Ycols <- which(right == ".Y raw")
samplenames <- substr(Xcols, 1, nchar(Xcols) - 6)
if (length(Xcols) == 0 && length(Ycols) == 0) {
#try again without space
right <- substr(captions, nch - 4, nch)
Xcols <- which(right == ".Xraw")
Ycols <- which(right == ".Yraw")
samplenames <- substr(Xcols, 1, nchar(Xcols) - 5)
}
if (length(Xcols) == 0 || length(Ycols) == 0 ||
length(Xcols) != length(Ycols)) {
stop("for each sample columns 'sample.X raw' and 'sample.Y raw' must be present")
}
} else {
right <- substr(captions, nch-1, nch)
Xcols <- which(right == ".X")
Ycols <- which(right == ".Y")
samplenames <- substr(captions[Xcols], 1, nch[Xcols] - 2)
if (length(Xcols) == 0 || length(Ycols) == 0 ||
length(Xcols) != length(Ycols)) {
stop("for each sample columns 'sample.X' and 'sample.Y' must be present")
}
}
minXY <- min(Ycols - Xcols)
maxXY <- max(Ycols - Xcols)
if (maxXY != minXY) {
stop("columns must be in the same order for each sample")
}
if (minXY > 0) XY <- c("X", "Y") else XY <- c("Y", "X")
#next we read the selected columns:
colClasses <- rep("NULL", length(captions))
colClasses[namecol] <- "character"
colClasses[c(Xcols, Ycols)] <- "numeric"
dat_wide <- readDatfile(filename, colClasses=colClasses)
dat_wide <- dat_wide[dat_wide[1] != "",]
#convert to long format:
Rfile <- paste(out,"meanR.dat", sep="_")
na.count <- function(x) { sum(is.na(x)) }
if (length(markergroups) == 0) {
#no groups, all markers in one go:
result <- wide2longGS(GSdata=dat_wide, samplenames, XY,
out, rdata=(filetype=="rdata"))
if (!is.na(out)) {
sumsR <- tapply(result$R, INDEX=result$SampleName, FUN=sum, na.rm=TRUE)
nasR <- tapply(result$R, INDEX=result$SampleName, FUN=na.count)
meanR <- data.frame(SampleName=names(sumsR),
meanR=sumsR / (nrow(dat_wide) - nasR),
missing=nasR)
writeDatfile(meanR, Rfile)
}
return(invisible(result))
} else {
#convert per marker group:
sumsR <- NULL
for (mg in seq_along(markergroups)) {
if (is.numeric(markergroups[[mg]])) {
#markergroups[[mg]] has the row numbers of this group
subdat <- dat_wide[markergroups[[mg]],]
} else {
#markergroups[[mg]] has the marker names of this group
subdat <- dat_wide[dat_wide[[1]] %in% markergroups[[mg]],]
}
fname <- paste(out, padded(mg, length(markergroups)), sep="_")
dat <- wide2longGS(GSdata=subdat, samplenames, XY,
fname, rdata=(filetype=="rdata"))
if (is.null(sumsR)) {
sumsR <- tapply(dat$R, INDEX=dat$SampleName, FUN=sum, na.rm=TRUE)
nasR <- tapply(dat$R, INDEX=dat$SampleName, FUN=na.count)
} else {
sumsR <- rowSums(cbind(sumsR,
tapply(dat$R, INDEX=dat$SampleName, FUN=sum, na.rm=TRUE)),
na.rm=TRUE)
nasR <- nasR + tapply(dat$R, INDEX=dat$SampleName, FUN=na.count)
}
}
meanR <- data.frame(SampleName=names(sumsR),
meanR=sumsR/(length(unlist(markergroups))-nasR),
missing=nasR)
writeDatfile(meanR, Rfile)
if (length(result) > 0) {
if (!is.null(names(markergroups))) {
names(result) <- names(markergroups)
}
return(invisible(result))
} else return(invisible(NULL))
}
} #readFullDataTable
wide2longGS <- function(GSdata, samplenames, XY,
outputfilename, rdata) {
#helper function for readFullDataTable, converts one wide data frame to
#one long data frame.
#GSdata: a wide data frame from a GenomeStudio FullDataTable, containing
#only the markers in this group with X and Y signal intensities (marker names
#in column 1, 2 columns with X and Y signals per sample)
#samplenames: all sample names in order of columns of GSdata
#XY: c("X", "Y") or c("Y", "X"), depending on order per sample
#outputfilename: the complete output file name, or NA
#rdata: TRUE or FALSE
#Function result: the same data as a long data frame, with columns R and ratio
#we want:
#MarkerName SampleName X Y
#M1 S1 111 112
#M2 S1 211 212
#M3 S1 311 312
#M1 S2 121 122
#M2 S2 221 222
#M3 S2 321 322
#from GenomeStudio we have:
#marker S1.X S1.Y S2.X S2.Y
#M1 111 112 121 122
#M2 211 212 221 222
#M3 311 312 321 322
#so we take 2 columns at a time and put them under each other
#(Note that the result of wide2longGS is sorted first by SampleName
# and within sample by MarkerName, in the result of wide2longAxiom
# it is the other way round)
markernames <- GSdata[,1]
GSdata <- as.matrix(GSdata[,-1])
nmrk <- nrow(GSdata)
nsamp <- ncol(GSdata) / 2
if (length(samplenames) != nsamp) stop("wide2longGS: samplenames don't match GSdata")
dat <- matrix(0.0, nrow=nmrk * nsamp, ncol=2)
if (nmrk > 0 && nsamp > 0) {
for (s in 1:nsamp) {
dat[(1 + (s-1) * nmrk) : (s * nmrk),] <-
GSdata[, (1 + (s-1) * 2) : (s * 2)]
}
}
dat <- data.frame(
MarkerName=rep(markernames, times=nsamp),
SampleName=rep(samplenames, each=nmrk),
dat)
#if Y appears before X, exchange:
names(dat)[3:4] <- XY
if (XY[2] == "X") dat <- dat[,c(1, 2, 4, 3)]
dat$R <- dat$X + dat$Y
dat$ratio <- dat$Y / dat$R
#write file:
if (!is.na(outputfilename)) {
if (rdata) {
save(dat, file=paste(outputfilename, ".RData", sep=""))
} else {
writeDatfile(dat, file=paste(outputfilename, ".dat", sep=""))
}
}
dat
} #wide2longGS
#'@title convert an Affymetrix AxiomCT1.summary file to the import format
#'for fitPoly
#'@description An Affymetrix AxiomCT1.summary.txt file in wide format (samples
#'side-by-side) is converted to a fitPoly input file in long format
#'@usage readAxiomSummary(AXdata, markergroups=list(),
#'out, filetype=c("dat", "RData")[2])
#'@param AXdata name of an AxiomCT1.summary.txt tab-separated text file
#'exported from Affymetrix Power Tools (APT); or a data.frame in the same
#'format:\cr
#'the file may start with a number of comment lines (starting with "#")
#'followed by a header line of which the first caption is "probeset_id" and all
#'other captions are sample names, all ending with ".CEL". For each marker there
#'are two lines, with probeset_id ending in "-A" and "-B"; the two lines for
#'each marker are consecutive. In addition there may be any number of control
#'markers, for each of which there is only one line, with probeset_id ending
#'in "-A".
#'@param markergroups a list with character vectors of marker names, or integer
#'vectors of marker numbers in file order. If the data set is large, the
#'conversion to long format may exceed memory limits. In these cases the data
#'can be split into marker groups that are converted separately and each saved
#'to a file (or returned as list items, but if that is possible splitting into
#'groups would not be needed)
#'@param out the name of an output file (without extension). If a list of
#'markergroups is given, out must be a valid file name (without extension);
#'in that case multiple output files are created with filenames in which
#'the list element numbers are appended to out. If no markergroups are specified
#'out may also be set to "" or NA; in that case no file is created and the
#'converted data are only returned as function result.\cr
#'If out is not "" or NA, then also a file <out>_meanR.dat is saved with for
#'all samples their mean R value and number of missing data (over all markers)
#'@param filetype either "dat" or "RData" (default): the former produces
#'tab-separated text files, the latter saves RData files with the converted
#'data in a data frame with name "dat"
#'@details The wide-format input is converted to a long-format form with
#'columns MarkerName, SampleName, X, Y, R (= X + Y) and ratio (= Y / R).
#'The X and Y signal intensities are obtained from the <marker>-A and <marker>-B
#'rows in the input data, R and ratio are calculated from these values.
#'@return If no markergroups are specified, a data.frame is returned with
#'columns MarkerName, SampleName, X, Y, R (= X + Y), ratio (= Y / (X+Y) ).\cr
#'If a list of markergroups is specified the function result is NULL
#'and the converted data.frames are only saved as files.\cr
#'If the saved files are RData files, they all contain one data.frame named
#'"dat".
#'@export
readAxiomSummary <- function(AXdata,
markergroups=list(),
out,
filetype=c("dat", "RData")[2]) {
#test input:
if (!is.list(markergroups)) {
stop("markergroups is not list")
}
if (missing(out)) stop("out must be specified")
if (length(out) != 1) stop("out must be one filename or NA")
if (!is.na(out)) {
if (!is.character(out)) {
stop("out must be one filename")
}
if (is.na(out) || out == "") {
out <- NA
} else {
if (!check.filename(paste(out,"714x996y8.xxx",sep="_"))) {
stop("out is not valid")
}
}
}
filetype <- tolower(filetype)
if (!is.na(out) && filetype != "dat" && filetype != "rdata") {
stop("error in readAxiomSummary: filetype must be 'dat' or 'RData'")
}
if (is.character(AXdata)) {
#filename: read file, trim ".CEL" from sample names
#find the caption line and read sample names
#(dashes and other unallowed characters in names are converted to dots in
#read.table)
con <- file(AXdata, "r")
repeat {
s <- readLines(con, n=1)
if (is.null(s) || length(s)==0) {
stop("could not read AXdata file, or end reached without finding caption line")
}
s <- strsplit(s, split="\t")
if (length(s) > 0) {
s <- gsub("(^ +)|( +$)", "", s[[1]], fixed=TRUE)
# a.o. remove leading & trailing whitespace
if (nchar(s[1]) > 0 && substr(s[1], 1, 1) != "#") break;
}
}
close(con)
if (s[1] != "probeset_id") {
stop("caption line doesn't start with probeset_id")
}
#remove the .CEL suffixes from the sample names if they are there:
samplenames <- s[2:length(s)]
nc <- nchar(samplenames)
cel <- substr(samplenames,nc-3, nc) == ".CEL"
samplenames[cel] <- substr(samplenames[cel], 1, nc-4)
#read the total file, skipping the initial comment lines:
#knowing markercount does not help because of the presence of very many
#control markers (DQC) after the last real marker
dat_wide <- read.table(AXdata, comment.char="#", sep="\t", header=TRUE,
nrows=-1)
dat_wide[,1] <- as.character(dat_wide[,1])
dat_wide <- dat_wide[dat_wide[,1]!="",]
nc <- nchar(dat_wide$probeset_id)
if (anyNA(nc) || any(nc < 3))
stop("some data lines have a probeset_id shorter than 3 characters")
lastchar <- substring(dat_wide$probeset_id, nc)
# old version:
# # exclude the DQC lines (the block of lines with probeset_id's all ending
# # with "-A"):
# dif <- lastchar[1:(length(lastchar)-1)] != lastchar[2:length(lastchar)]
# # at the bottom: if ending in -A it is a DQC, else the B-line of a real mrk:
# dif[length(lastchar)] <- lastchar[length(lastchar)] == "B"
# # when a block of DQC follows a block of real markers: ok
# # when a block of real markers follows a block of DQC: also ok
# dat_wide <- subset(dat_wide, substr(probeset_id, 1, 4) != "DQC-")
# dat_wide <- dat_wide[dif,]; lastchar <- lastchar[dif]#
# new version 20190221: don't assume that the DQC lines are all at the
# bottom;
# instead, look for all probeset_id with -B that have the same probeset_id
# with -A on the preceding line:
Blines <- setdiff(which(lastchar == "B"), 1)
Alines <- Blines - 1
Alines <- Alines[lastchar[Alines] == "A"]
Blines <- Alines + 1
probnames <- substr(dat_wide$probeset_id, 1, nc-2) # without -A or -B suffix
if (!all(probnames[Alines] == probnames[Blines]))
stop("some A/B line pairs have a different probeset name")
dat_wide <- dat_wide[sort(c(Alines, Blines)),]
dat_wide <- dat_wide[substr(dat_wide$probeset_id, 1, 4) != "DQC-",]
rm(lastchar, probnames, Alines, Blines)
} else if (is.data.frame(AXdata)) {
if (names(AXdata)[1] != "probeset_id")
stop("invalid AXdata data frame")
#have (only) the correct rows been read?
ok <- FALSE
if (nrow(AXdata) %% 2 == 0) {
nm <- as.character(AXdata[,1]); nc <- nchar(nm)
arows <- seq(1, nrow(AXdata), 2)
ok <- all(substr(nm[arows],nc[arows], nc[arows]) == "A") &&
all(substr(nm[arows+1], nc[arows+1], nc[arows+1]) == "B")
rm(nm, arows)
}
if (!ok) stop("AXdata should only contain alternating A and B rows")
dat_wide <- AXdata
samplenames <- names(dat_wide)[-1]
} else stop("readAxiomSummary: AXdata must be file name or data frame")
#test the markernames:
if (nrow(dat_wide) %% 2 != 0) {
stop("readAxiomSummary: odd number of marker lines")
}
nmrk <- nrow(dat_wide) / 2
anames <- as.character(dat_wide[,1][rep(c(TRUE, FALSE), times=nmrk)])
nch <- nchar(anames)
err <- which(substr(anames, nch-1, nch) != "-A")
if (length(err) != 0) stop("readAxiomSummary: not all odd probeset-id end with '-A'")
anames <- substr(anames, 1, nch-2) #strip off the "-A"
bnames <- as.character(dat_wide[,1][rep(c(FALSE, TRUE), times=nmrk)])
nch <- nchar(bnames)
err <- which(substr(bnames, nch-1, nch) != "-B")
if (length(err) != 0) stop("readAxiomSummary: not all even probeset-id end with '-B'")
bnames <- substr(anames, 1, nch-2) #strip off the "-B"
err <- which (anames != bnames)
if (length(err) != 0) stop("readAxiomSummary: not all even probeset-id's are paired")
markernames <- anames; rm(nch, err, anames, bnames)
Rfile <- paste(out,"meanR.dat", sep="_")
na.count <- function(x) { sum(is.na(x)) }
#convert to long format:
#samplenames <- names(dat_wide[2:length(dat_wide)])
if (length(markergroups) == 0) {
#no groups, all markers in one go:
result <- wide2longAxiom(AXdata=dat_wide, markernames, samplenames,
out, rdata=(filetype=="rdata"))
if (!is.na(out)) {
sumsR <- tapply(result$R, INDEX=result$SampleName, FUN=sum, na.rm=TRUE)
nasR <- tapply(result$R, INDEX=result$SampleName, FUN=na.count)
meanR <- data.frame(SampleName=names(sumsR),
meanR=sumsR/(length(markernames)-nasR),
missing=nasR)
writeDatfile(meanR, Rfile)
}
return(invisible(result))
} else {
#convert per marker group:
sumsR <- NULL
result <- list()
for (mg in seq_along(markergroups)) {
if (is.numeric(markergroups[[mg]])) {
#markergroups[[mg]] has the row numbers of this group
mrknrs <- sort(c(2 * markergroups[[mg]] - 1,
2 * markergroups[[mg]]))
subdat <- dat_wide[mrknrs,]
submarkernames <- markernames[sort(markergroups[[mg]])]
} else {
#markergroups[[mg]] has the marker names of this group
mrknrs <- which(markernames %in% markergroups[[mg]])
submarkernames <- markernames[mrknrs] #now in order of markernames
mrknrs <- sort(c(2 * mrknrs - 1, 2 * mrknrs))
subdat <- dat_wide[mrknrs,]
}
fname <- paste(out, padded(mg, length(markergroups)), sep="_")
dat <- wide2longAxiom(AXdata=subdat, submarkernames, samplenames,
fname, rdata=(filetype=="rdata"))
if (is.null(sumsR)) {
sumsR <- tapply(dat$R, INDEX=dat$SampleName, FUN=sum, na.rm=TRUE)
nasR <- tapply(dat$R, INDEX=dat$SampleName, FUN=na.count)
} else {
sumsR <- rowSums(cbind(sumsR,
tapply(dat$R, INDEX=dat$SampleName, FUN=sum, na.rm=TRUE)),
na.rm=TRUE)
nasR <- nasR + tapply(dat$R, INDEX=dat$SampleName, FUN=na.count)
}
}
meanR <- data.frame(SampleName=names(sumsR),
meanR=sumsR/(length(unlist(markergroups))-nasR),
missing=nasR)
writeDatfile(meanR, Rfile)
return(invisible(NULL))
}
} #readAxiomSummary
wide2longAxiom <- function(AXdata, markernames, samplenames,
outputfilename, rdata) {
#helper function for readAxiomSummary, converts one wide data frame to
#one long data frame.
#AXdata: data frame or matrix in wide format. If a data frame, then in format
# of an AxiomGT1.summary file, containing probeset_id in first column
# and samples in the other columns; for each probe only the markers
# in this group with X and Y signal intensities (2 rows per
# SNP, probeset_id-A or -B in column 1 (not used, use markernames instead)
# If matrix: same format but without the probeset_id column
#samplenames: all sample names in order of columns of AXdata
#outputfilename: the complete output file name, or NA
#rdata: TRUE or FALSE
#Function result: the same data as a long data frame, with columns R and ratio
#we want:
#MarkerName SampleName X Y
#M1 S1 111 112
#M1 S2 121 122
#M2 S1 211 212
#M2 S2 221 222
#M3 S1 311 312
#M3 S2 321 322
#from Axiom we have:
#probeset_id S1 S2
#M1-A 111 121
#M1-B 112 122
#M2-A 211 221
#M2-B 212 222
#M3-A 311 321
#M3-B 312 322
#so we take 2 rows at a time, t(ranspose) them and put them under each other
#(Note that the result of wide2longAxiom is sorted first by MarkerName
# and within marker by SampleName, in the result of wide2longGS
# it is the other way round)
if (is.data.frame(AXdata))
AXdata <- as.matrix(AXdata[,-1])
nmrk <- nrow(AXdata) / 2
nsamp <- ncol(AXdata)
if (length(markernames) != nmrk) stop("wide2longAxiom: markernames don't match AXdata")
if (length(samplenames) != nsamp) stop("wide2longAxiom: samplenames don't match AXdata")
dat <- matrix(0.0, nrow=nmrk * nsamp, ncol=2)
colnames(dat) <- c("X", "Y")
if (nmrk > 0 && nsamp > 0) {
for (r in 1:nmrk) {
dat[(1 + (r-1) * nsamp) : (r * nsamp),] <-
t(AXdata[(1 + (r-1) * 2) : (r * 2), ])
}
}
dat <- data.frame(
MarkerName=rep(markernames, each=nsamp),
SampleName=rep(samplenames, times=nmrk),
dat)
dat$R <- dat$X + dat$Y
dat$ratio <- dat$Y / dat$R
#write file:
if (!is.na(outputfilename)) {
if (rdata) {
save(dat, file=paste(outputfilename, ".RData", sep=""))
} else {
writeDatfile(dat, file=paste(outputfilename, ".dat", sep=""))
}
}
dat
} #wide2longAxiom
removeNArows <- function(x, column, blanks=TRUE) {
#x is a data.frame, column is one column number
#return value: the same data.frame with the rows removed where the
#column has NA or (if blanks is TRUE) only whitespace.
#If this column is a factor or character, it will be returned as
#a character with no starting or trailing whitespace
if (blanks) {
if (is.factor(x[,column])) x[,column] <- as.character(x[,column])
if (is.character(x[,column])) {
x[,column] <- gsub("(^ +)|( +$)", "", x[,column], fixed=TRUE)
x[x[,column] == "", column] <- NA
}
}
x[!is.na(x[,column]),]
}
check.sampletable <- function(sampletable,
SampleID, CustomerID, Ploidy,
SampleID_levels) {
#used by splitNrenameSamples
#sampletable: a data frame with at least the columns for SampleID and
# CustomerID, and for Ploidy if this is not NULL
#SampleID, CustomerID, Ploidy: the actual names of these columns in
# sampletable, or NULL (only allowed for Ploidy)
#SampleID_levels: the levels of SampleID that occur in the marker data;
# all of these must also occur in the SampleID column of
# sampletable
#return value: a string (an error message) if there are problems,
# else a vector with the column numbers of SampleID, CustomerID,
# Ploidy in sampletable
sampidcol <- which(names(sampletable) == SampleID)
if (length(sampidcol) != 1)
return("sampletable should have one SampleID column")
custidcol <- which(names(sampletable) == CustomerID)
if (length(custidcol) != 1)
return("sampletable should have one CustomerID column")
#do all sample names in marker data also occur in SampleID column of sampletable?
SampleID_levels <- as.character(SampleID_levels)
sampletable[,sampidcol] <- as.character(sampletable[,sampidcol])
if (anyNA(match(SampleID_levels, sampletable[, sampidcol])))
return("not all SampleNames in dat occur in sampletable column SampleID")
if (is.null(Ploidy)) {
ploidycol <- NA
} else {
ploidycol <- which(names(sampletable) == Ploidy)
if (length(ploidycol) != 1)
return("sampletable should have one Ploidy column")
sampletable <- removeNArows(sampletable, ploidycol)
}
if (sum(is.na(sampletable[, c(sampidcol, custidcol)])) > 0)
return("missing data in SampleID or CustomerID columns")
#are all SampleID and CustomerID for sampletable unique?
sampletable[, custidcol] <- as.character(sampletable[, custidcol])
if (length(unique(sampletable[, sampidcol])) != nrow(sampletable))
return("duplicate names in SampleID column")
if (length(unique(sampletable[, custidcol])) != nrow(sampletable))
return("duplicate names in CustomerID column")
result <- c(sampidcol, custidcol, ploidycol)
names(result) <- c("sampidcol", "custidcol", "ploidycol")
result
} #check.sampletable
#'@title Rename samples from array codes to user codes, and split diploid
#'from polyploid samples
#'@description After conversion of array data to fitPoly format, this function
#'replaces the array codes for the samples to user codes, and it splits the
#'data into a part for diploid and one for polyploid samples if both are present.
#'@usage splitNrenameSamples(dat, sampletable, SampleID, CustomerID,
#'Ploidy=NULL, out, filetype=c("dat","RData")[2])
#'@param dat a data frame in "long" format as returned by readFullDataTable or
#'readAxiomSummary
#'@param sampletable a data frame with at least the columns for SampleID,
#'CustomerID and Ploidy (these columns may have any name, specified by the
#'next parameters)
#'@param SampleID the title of the column in sampletable containing the codes
#'for the samples in the array file - there must be a one-to-one relation
#'between the array codes and the user codes!
#'@param CustomerID the title of the column in sampletable containing the
#'user codes for the samples - there must be a one-to-one relation between the
#'array codes and the user codes!
#'@param Ploidy the title of the column in sampletable containing the ploidy
#'of the samples; default NULL means no split based on ploidy must be done. If not
#'NULL, for each different ploidy a separate data.frame will be created and
#'samples with ploidy missing or "" will be omitted from all files.
#'@param out the base filename of the output files; to this will be appended the
#'the ploidy level(s) if Ploidy is not NULL, and an extension .dat or .RData,
#'depending on filetype). Setting out to NA or "" results in no output file(s)
#'being created
#'@param filetype either "dat" or "RData" (default): the former produces tab-
#'separated text files (one for each element of the return value),
#'the latter saves one RData file containing a list named dat, identical
#'to the return value
#'@return if Ploidy not specified, the original dat data.frame with substituted
#'SampleNames; if Ploidy specified a list with one
#'element for each different value in the Ploidy column; the names of the
#'elements are then these ploidy values. Each element is a subset of the
#'original dat data.frame containing only the rows with samples of that ploidy,
#'and with substituted SampleNames
#'@export
splitNrenameSamples <- function(dat, sampletable,
SampleID, CustomerID, Ploidy=NULL,
out, filetype=c("dat","RData")[2]) {
if (!(leftstr(tolower(filetype), 3) %in% c("dat", "rda")))
stop("invalid filetype")
samplecols <-
check.sampletable(sampletable, SampleID, CustomerID, Ploidy,
SampleID_levels=as.character(unique(dat$SampleName)))
if (is.character(samplecols))
stop(samplecols)
sampcol <- samplecols[1]; customercol <- samplecols[2]; ploidycol <- samplecols[3]
if (missing(out)) stop("out must be specified")
if (length(out) != 1) stop("out must be one filename or NA")
if (!is.na(out)) {
if (!is.character(out)) {
stop("out must be one filename")
}
if (is.na(out) || out == "") {
out <- NA
} else {
if (!check.filename(paste(out,"714x996y8.xxx",sep="_"))) {
stop("out is not valid")
}
}
}
if (!is.na(ploidycol)) {
sampletable <- removeNArows(sampletable, ploidycol)
ploi <- sort(unique(sampletable[, ploidycol]))
nas <- anyNA(ploi)
ploi <- ploi[!is.na(ploi)]
cat(paste("sampletable contains ploidy levels", paste(ploi, collapse=" ")))
if (nas) cat(" and also contains missing ploidy values")
cat("\n")
cat("the samples will be split per ploidy value")
if (nas) cat(" and samples without ploidy will be omitted")
cat("\n")
# reply <- readline(prompt="Proceed (Y/N) ?")
# reply <- tolower((reply))
# if (!(reply %in% c("y", "yes"))) stop("splitNrenameSamples: user abort")
}
#checks against NA or duplicates in SampleID and CustomerID are already done in
#check.sampletable, and also that all dat$SampleName values are in SampleID
#rename the dat$SampleName from SampleID to CustomerID:
dat$SampleName <- factor(dat$SampleName)
sampnames <- levels(dat$SampleName)
levels(dat$SampleName) <-
sampletable[match(sampnames, sampletable[, sampcol]), customercol]
#split different ploidy levels:
if (is.null(Ploidy)) {
#no split
result <- dat
} else {
result <- list()
for (p in seq_along(ploi)) {
samp <- sampletable[sampletable[,samplecols[3]] == ploi[p], samplecols[2]]
result[[p]] <- dat[dat$SampleName %in% samp,]
result[[p]]$SampleName <- factor(result[[p]]$SampleName) #drop unused levels
names(result)[p] <- ploi[p]
}
}
dat <- result
#save file(s):
if (!is.na(out)) {
if (leftstr(tolower(filetype), 3) == "rda") {
save(dat, file=paste(out, "RData", sep="."))
} else {
for (p in seq_along(dat)) {
fname <- paste(out, "_", names(result)[p], ".dat", sep="")
writeDatfile(dat[[p]], fname)
}
}
}
invisible(dat)
} #splitNrenameSamples
# check distribution of R per SNP *********************************************
#'@title Calculate statistics of R per marker
#'@description Calculate the min, max, mean and 50% and 95% quantiles of R
#'for each marker
#'@usage calcRstats(datpoly, out)
#'@param datpoly a data frame in long format (input format for fitPoly)
#'with at least columns MarkerName, SampleName, R
#'@param out name of output file; if NA no file is written. File contains
#'the same data as the return value
#'@details The data.frame returned by this function is used as input for
#'function selMarkers_byR
#'@return a data frame with columns MarkerName, mean, min, q50, q95, max
#'which are all statistics of the R values per marker
#'@export
calcRstats <- function(datpoly, out) {
stats <- function(x) { #function stats nested within calcRstats
y <- x[!is.na(x)]
result <- c(mean(y), quantile(y, probs=c(0, 0.50, 0.95, 1), names=FALSE))
}
Rdist <- tapply(datpoly$R, INDEX=datpoly$MarkerName, FUN=stats)
Rdist <- do.call(rbind, Rdist)
colnames(Rdist) <- c("mean", "min", "median", "q95", "max")
Rdist <- data.frame(MarkerName=rownames(Rdist), Rdist)
rownames(Rdist) <- NULL
if (!is.na(out) && out!="")
writeDatfile(Rdist, out)
invisible(Rdist)
} #calcRstats
#'@title Select markers at specified R levels
#'@description Select markers based on their R statistics, for studying
#'the relation between R level and marker quality
#'@usage selMarkers_byR(Rstats, Rlevels, mrkperlevel=1, stat="q95")
#'@param Rstats a data frame as returned by calcRstats
#'@param Rlevels a vector of R levels in increasing order (preferably from
#'the minimum to the maximum of stat over all markers in Rstats)
#'@param mrkperlevel number, default 1: the number of markers to select at
#'each of the levels in Rlevels
#'@param stat the name of one of the statistics columns in Rstats
#'@details The return value of this function is intended to be used for
#'studying the relation between the marker quality and the value of the
#'chosen R statistic, e.g. by drawing XY-plots of each of the selected
#'markers. By finding a suitable threshold for the R statistic bas markers
#'could be excluded from evaluation by fitPoly, saving time because bad
#'markers take the longest to be scored and are then often rejected anyway.
#'@return a selection from Rstats, in ascending order of column stat.
#'The first <mrkperlevel> markers with stat above each Rlevel are returned;
#'duplicated selections are removed (so the number of returned markers may be
#'less than length(Rlevels) * mrkperlevel)
#'@export
selMarkers_byR <- function(Rstats, Rlevels, mrkperlevel=1, stat="q95") {
#Rstats: a data frame as returned by calcRstats
#Rlevels: a numeric vector with the levels of R at which markers should be
#selected
#mrkperlevel: the number of markers to select per level
#stat: the column name of the R statistic to use
#return value: a selection from Rstats, in ascending order of column stat.
# The first <mrkperlevel> markers with stat above each Rlevel
# are returned; duplicated selections are removed (so the
# number of returned markers may be less than
# length(Rlevels) * mrkperlevel)
statcol <- which(names(Rstats) == stat)
if (length(statcol) != 1)
stop(paste("column ", stat, "missing"))
Rstats <- Rstats[order(Rstats[,statcol]),]
Rlevels <- sort(Rlevels)
firstsel <- integer(0)
for (i in 1:length(Rlevels)) {
seli <- which(Rstats[,statcol] >= Rlevels[i])
if (length(seli) == 0) firstsel[i] <- NA else
firstsel[i] <- min(seli)
}
sel <- integer(0)
for (i in 1:mrkperlevel) sel <- c(sel, firstsel + i - 1)
sel <- sort(sel[!is.na(sel)])
sel <- sel[sel <= nrow(Rstats)]
sel <- sel[!duplicated(sel)]
Rstats[sel,]
} #selMarkers_byR
# XY-plotting *****************************************************************
get.genocol.old <- function() {
#return value: a set of 6 pastel colors to represent dosages,
#from light red (nulliplex) to light blue (quadruplex) and grey for missing
c(rgb(red=255,green=170,blue=170,maxColorValue=255), #1: light red
rgb(red=255,green=211,blue=155,maxColorValue=255), #2: "burlywood1", lightred-orange
rgb(red=250,green=187,blue=255,maxColorValue=255), #3: light bluish magenta
rgb(red=152,green=245,blue=255,maxColorValue=255), #4: "cadetblue1", light bluish cyan
rgb(red=180,green=180,blue=255,maxColorValue=255), #5: light blue
"grey") #6: NA
} #get.genocol.old
#'@title Generate a set of genotype (dosage) colors for XY-plots
#'@description For each of the dosages 0 to ploidy a color is generated,
#'and an additional one for samples with no dosage assigned.
#'@usage get.genocol(ploidy, pastel=TRUE)
#'@param ploidy any ploidy level >= 2
#'@param pastel if TRUE (default) light (pastel) colors are generated, else more
#'intense colors
#'@details the colors range from red (first item, dosage 0) via blue
#'(dosage ploidy/2) to green (dosage ploidy); the color for missing dosage is
#'grey
#'@return A vector of <ploidy + 2> color values; items 1 .. ploidy+1 for
#'dosage 0 .. ploidy and item ploidy+2 for missing dosages
#'@export
get.genocol <- function(ploidy, pastel=TRUE) {
#return value: a set of (ploidy+1) colors to represent dosages,
#from red (item 1, nulliplex) via blue (dosage ploidy/2) to green (item ploidy+1, dosage=ploidy), and grey (item ploidy+1) for
#missing values
#This order of colors was chosen because the transition colors between
#red and green, which are not very clear, are avoided
minint <- ifelse(pastel, 160, 0) #minimum intensity per color for pastels
oddploidy <- ploidy %% 2 == 1
if (oddploidy) ploidy <- 2 * ploidy
RGB <- matrix(0, nrow=ploidy+2, ncol=3)
colnames(RGB) <- c("red", "green", "blue")
#RGB[1,] <- c( 255 , minint, minint) #red, dosage=0
#RGB[ploidy/2+1,] <- c(minint/2, 255, minint/2) #green, dosage=ploidy/2
#RGB[ploidy+1,] <- c(minint , minint, 255) #blue, dosage=ploidy
RGB[1,] <- c(255, minint, minint) #red, dosage=0
RGB[ploidy/2+1,] <- c(minint , minint, 255) #blue, dosage=ploidy/2
RGB[ploidy+1,] <- c(minint/2, ifelse(pastel, 255, 200), minint/2) #green, dosage=ploidy
RGB[ploidy+2,] <- col2rgb("grey") #grey, dosage NA
if (ploidy > 2) for (p in 2:(ploidy/2)) {
#between nulliplex (red) and ploidy/2 (green): orange-brownish
fact <- (p-1)/(ploidy/2)
RGB[p,] <- RGB[1,] + round(fact * (RGB[ploidy/2+1,] - RGB[1,]))
}
if (ploidy > 2) for (p in (ploidy/2+2):(ploidy)) {
#between ploidy/2 (green) and ploidy (blue): cyan-ish
fact <- (p-(ploidy/2+1))/(ploidy/2)
RGB[p,] <- RGB[ploidy/2+1,] + round(fact * (RGB[ploidy+1,] - RGB[ploidy/2+1,]))
}
if (oddploidy) RGB <- RGB[c(seq(1, ploidy+1, by=2), nrow(RGB)),]
result <- apply(RGB, 1, function(x) rgb(red=x[1], green=x[2], blue=x[3],
maxColorValue=255))
} #get.genocol
calc.R.threshold <- function(Rquantile, Rfactor=0.5, lowR=0) {
#Rquantile: a vector of R quantile values per marker (e.g. 95% quantiles)
#R threshold is Rfactor * Rquantile, but at least lowR
#This function is used by makeFitPolyFiles and drawXYplots
result <- Rfactor * Rquantile
result[result<lowR] <- lowR
result
} #calc.R.threshold
#'@title Draws an XY-plot showing allele signals and assigned dosages
#'@description Draws an XY-plot for one markers showing the X and Y signals
#'of each sample and their assigned dosages
#'@usage XY_plot(title="", XYdat, shift=0, ploidy=NULL,
#'genocol="grey", pch=1, cex=1,
#'sel.samples=as.character(unique(XYdat$SampleName)), omit.pch=".",
#'omit.col=NULL, sample.groups=list(), groups.col="black", groups.pch=1,
#'groups.cex=1, groups.rnd=FALSE, R.thresholds=NA, R.col="black", R.lty=1)
#'@param title the main title above the plot
#'@param XYdat a data frame with at least columns SampleName, X and Y;
#'column geno (if present) is also used. Contains data for all samples, one SNP.
#'geno is the dosage (0 .. <ploidy>), with NA or <ploidy+1> for missing dosage
#'info
#'@param shift a single integer, default 0: by how much should the geno
#'be shifted?
#'@param ploidy a single integer specifying the ploidy; default NA, only
#'needed if there is a column geno in XYdat
#'@param genocol a vector of color values to be used for plotting
#'the sel.samples according on their geno (dosage) value; if only one value is
#'given (default) all samples are plotted in that color
#'@param pch the plot character to plot the sel.samples; default 1 is an
#'open circle
#'@param cex the relative size of the sample symbols
#'@param sel.samples character vector (not a factor) with the names of the
#'important samples: those that must be plotted in colors genocol and symbol pch
#'@param omit.pch the plot character to use for the other samples, default a dot
#'@param omit.col vector of two colors to use for the other samples; the first
#'color is used for sample with a geno (dosage) value, the second color for
#'unscored samples; recycled if needed, with NA (default) the non-selected
#'samples are not plotted
#'@param sample.groups a list specifying samples to be highlighted in a
#'different color and/or symbol and/or size. For each group of samples the list
#'has one vector of sample names; the list may also be empty
#'@param groups.col a vector or color values, one for each item (vector of
#'sample names) in sample.groups; recycled if shorter than sample.groups
#'@param groups.pch a vector of plot symbols, one for each item in
#'sample.groups; recycled if shorter than sample.groups
#'@param groups.cex a vector of relative symbol sizes, one for each item in
#'sample.groups; recycled if shorter than sample.groups
#'@param groups.rnd FALSE (default) or TRUE. If FALSE, all samples in
#'sample.groups are drawn in the order in which they appear in XYdat; if TRUE
#'they are drawn in a random order. Note that the samples are never drawns in
#'group order (except is the samples are already in group order in XYdat and
#'groups.rnd=FALSE).
#'@param R.thresholds a vector of thresholds for R to plot; is NA (default)
#'no R thresholds are plotted
#'@param R.col a vector of color values for drawing R thresholds, one for
#'each value of R.thresholds; recycled if needed
#'@param R.lty a vector of line types for drawing the R thresholds, one for
#'each value of R.thresholds; recycled if needed
#'@return The function produces an XY-plot and returns NULL
#'@export
XY_plot <- function(title="", #the main title above the plot
XYdat, #the data, with or without geno column, for 1 SNP,
shift=0,
ploidy=NULL,
genocol="grey", #if a vector with colors then it should have
# colors for geno 0 .. ploidy and one color for
# missing geno values; i.e. its length should
# be either 1 or ploidy+2
pch=1, #point character, 1 is open circle
cex=1, #relative size, 1 is default
sel.samples=as.character(unique(XYdat$SampleName)), #the important samples
omit.pch=".", #the pch for the other samples
omit.col=NULL, #the colors for the other samples
sample.groups=list(), #groups of samples to highlight
groups.col="black", #a color for each group
groups.pch=1, #a symbol for each group
groups.cex=1, #relative size of symbols for each group
groups.rnd=FALSE, #randomize order in which group samples are drawn
R.thresholds=NA, #thresholds for R (default must ne NA, not NULL!)
R.col="black", #colors of the R thresholds
R.lty=1) { # line type of R thresholds(1=solid, 2=dashed, 3=dotted)
#XYdat: a data frame with at least columns SampleName, X and Y; column geno
# (if present) is also used. Contains data for all samples, one SNP.
# geno is the dosage (0 is nulliplex .. <ploidy> is <ploidy>plex),
# with NA or <ploidy+1> for missing dosage info
#Produces a single XY plot of signal intensities, for one marker.
#First all samples are plotted either all in one color (grey by default) or
#with different colors for each dosage (if a column geno is present in XYdat)
#On top of this, several groups of samples can be plotted (again), with one
#color per group; e.g. to highlight parents, subpopulations etc)
#Also the symbol and the size of the symbol to print for the backgound
#samples and the samples in each group can be specified.
#Finally different R threshold values can be indicated by lines, of which the
#color and line type can be specified.
max.xy <- max(c(XYdat$X, XYdat$Y), na.rm=TRUE)
if (is.na(max.xy) || max.xy <= 0) {
plot(0~0, col="white", main=title, xlim=c(0,1), ylim=c(0,1))
} else {
if ("geno" %in% names(XYdat)) {
if (is.null(ploidy)) stop("XY_plot: ploidy must be specified")
mvgeno <- ploidy+1
if (!all(XYdat$geno %in% c(0:mvgeno, NA)))
stop("XY_plot: geno contains invalid values")
XYdat$geno[XYdat$geno == mvgeno] <- NA
XYdat$geno <- shiftdosages(XYdat$geno, shift, ploidy)
XYdat$geno[is.na(XYdat$geno)] <- mvgeno
genocol <- rep(genocol, length.out=ploidy+2)
} else {
#no geno column
mvgeno <- length(genocol) - 1 #the last value in genocol represents missing genotype
XYdat$geno <- rep(mvgeno, nrow(XYdat)) #all unknown
}
#at this point there are no samples with geno NA left,
#all those that were NA have been assigned mvgeno
if (is.null(omit.col)) {
#then we make all genotyped omitted samples black and the
#non-genotyped omitted samples grey:
omit.col <- c(rep("black", length(genocol)-1), "grey")
} else omit.col <- rep(omit.col, length.out=length(genocol))
if (is.factor(XYdat$SampleName)) {
omit.samples <- setdiff(levels(XYdat$SampleName), sel.samples)
} else omit.samples <- setdiff(unique(XYdat$SampleName), sel.samples)
# first make the plot with all omitted samples in omit.col:
# (those with missing geno first, then those with nonmissing geno)
XY.nogeno <- XYdat[XYdat$geno == mvgeno,, drop=FALSE]
XY.geno <- XYdat[XYdat$geno < mvgeno,, drop=FALSE]
XYsub <- XY.nogeno[XY.nogeno$SampleName %in% omit.samples,, drop=FALSE]
plot(XYsub$Y ~ XYsub$X,
col=omit.col[XYsub$geno + 1], pch=omit.pch, cex=cex,
main=title,
xlim=c(0, max.xy), ylim=c(0, max.xy), xlab="X", ylab="Y")
XYsub <- XY.geno[XY.geno$SampleName %in% omit.samples,, drop=FALSE]
points(XYsub$Y ~ XYsub$X,
col=omit.col[XYsub$geno + 1], pch=omit.pch, cex=cex)
# next make the plot with all non-omitted samples in genocol:
# (those with missing geno first, then those with nonmissing geno)
XYsub <- XY.nogeno[XY.nogeno$SampleName %in% sel.samples,, drop=FALSE]
points(XYsub$Y ~ XYsub$X,
col=genocol[XYsub$geno + 1], pch=pch, cex=cex)
XYsub <- XY.geno[XY.geno$SampleName %in% sel.samples,, drop=FALSE]
points(XYsub$Y ~ XYsub$X,
col=genocol[XYsub$geno + 1], pch=pch, cex=cex)
#next, draw the groups of samples
if (is.list(sample.groups) &&
length(sample.groups) > 0) {
groups.col <- rep(groups.col, length.out=length(sample.groups))
groups.pch <- rep(groups.pch, length.out=length(sample.groups))
groups.cex <- rep(groups.cex, length.out=length(sample.groups))
# old version: draw groups successively, last group on top
#for (grp in seq_along(sample.groups)) {
#XYsub <- XYdat[XYdat$SampleName %in% sample.groups[[grp]],]
#points(XYsub$Y ~ XYsub$X,
# col=groups.col[grp], pch=groups.pch[grp], cex=groups.cex[grp])
#}
# new version 20180219: draw all at the same time
# (order of points is order in the XYdat data frame,
# there is no group order with one obscuring the other)
XYsub <- XYdat[XYdat$SampleName %in% unlist(sample.groups),, drop=FALSE]
samp.col <- samp.pch <- samp.cex <- numeric(nrow(XYsub))
for (grp in seq_along(sample.groups)) {
grpsamp <- XYsub$SampleName %in% sample.groups[[grp]]
samp.col[grpsamp] <- groups.col[grp]
samp.pch[grpsamp] <- groups.pch[grp]
samp.cex[grpsamp] <- groups.cex[grp]
}
samporder <- seq_len(nrow(XYsub))
if (groups.rnd) {
samporder <- sample(samporder)
}
points(XYsub$Y[samporder] ~ XYsub$X[samporder], col=samp.col[samporder],
pch=samp.pch[samporder], cex=samp.cex[samporder])
}
#next, draw the R.threshold lines at X + Y = R.threshold:
R.col <- rep(R.col, length.out=length(R.thresholds))
R.lty <- rep(R.lty, length.out=length(R.thresholds))
for (i in 1:length(R.thresholds)) {
lines(c(0, R.thresholds[i]) ~ c(R.thresholds[i], 0),
col=R.col[i], lty=R.lty[i]) #no line if R.threshold[i] is NA
}
}
invisible(NULL)
} #XY_plot
#'@title Draws a series of pages, each with 6 XY-plots showing allele signals
#'and assigned dosages
#'@description Draws 6 XY-plots per page for a series of markers; each XY-plot is
#'drawn by function XY_plot
#'@usage drawXYplots(dat, markers=NA, out, genocol="grey", pch=1, cex=1,
#'sel.samples=as.character(unique(XYdat$SampleName)), omit.pch=".",
#'omit.col=c(rep("black", length(genocol)-1), "grey"), sample.groups=list(),
#'groups.col="black", groups.pch=1, groups.cex=1, groups.rnd=FALSE,
#'R.col="black", R.lty=1, drawRthresholds=FALSE,
#'Rthreshold.param=c(0.95, 0.5, 0), ploidy)
#'@param dat a data.frame with at least columns MarkerName, SampleName, X and Y;
#'column geno (if present) is also used
#'@param markers either a vector with names of markers to plot or a
#'data.frame with at least column MarkerName containing the names of the markers
#'to plot; default NA means all markers in dat. If markers is a data.frame that
#'also has a column shift, the geno values in dat will be shifted accordingly.
#'This allows to use (a selection from) the output of checkF1 (with or without
#'parameter shiftmarkers) as input for drawXYplots.
#'@param out base for filenames of output, will be extended with
#'_pagenumber.png; may include a path, but the directory where the plot files
#'are to be saved must already exist
#'@param genocol a vector of color values to be used for plotting
#'the sel.samples according on their geno (dosage) value; if only one value is
#'given (default) all samples are plotted in that color
#'@param pch the plot character to plot the sel.samples; default 1 is an
#'open circle
#'@param cex the relative size of the sample symbols
#'@param sel.samples character vector (not a factor) with the names of the
#'important samples: those that must be plotted in colors genocol and symbol pch
#'@param omit.pch the plot character to use for the other samples, default a dot
#'@param omit.col vector of two colors to use for the other samples; the first
#'color is used for sample with a geno (dosage) value, the second color for
#'unscored samples
#'@param sample.groups a list specifying samples to be highlighted in a
#'different color and/or symbol and/or size. For each group of sample the list
#'has one vector of sample names; the list may also be empty
#'@param groups.col a vector or color values, one for each item (vector of
#'sample names) in sample.groups; recycled if shorter than sample.groups
#'@param groups.pch a vector of plot symbols, one for each item in
#'sample.groups; recycled if shorter than sample.groups
#'@param groups.cex a vector of relative symbol sizes, one for each item in
#'sample.groups; recycled if shorter than sample.groups
#'@param groups.rnd FALSE (default) or TRUE. If FALSE, all samples in
#'sample.groups are drawn in the order in which they appear in XYdat; if TRUE
#'they are drawn in a random order. Note that the samples are never drawns in
#'group order (except if the samples are already in group order in XYdat and
#'groups.rnd=FALSE).
#'@param R.col a vector of color values for drawing R thresholds, one for
#'each value of Rthreshold.param; recycled if needed
#'@param R.lty a vector of line types for drawing the R thresholds, one for
#'each value of Rthreshold.param; recycled if needed
#'@param drawRthresholds FALSE (default) or TRUE: whether R thresholds should be
#'drawn
#'@param Rthreshold.param either a list of vectors each of length 3 (for
#'multiple R thresholds) or one vector of length 3 (for one R threshold). Each
#'vector defines one R threshold and is based on a specified quantile of the
#'distribution of R values for the current marker. The first number in each
#'vector is the R quantile, the second is a number to multiply that R quantile
#'with, and the third is the minimum value of the result. The default of
#'c(0.95, 0.5, 0) means that 0.5 * the 95% quantile of R is used (which is
#'always higher than 0, the minimum result). This is often a good cut-off
#'value to discard samples, or to signal markers with many samples below that
#'value.
#'@param ploidy a single integer specifying the ploidy, only needed if
#'dat contains a column geno
#'@return The function produces a series of pages with plots and returns NULL
#'@export
drawXYplots <- function(dat, markers=NA, out,
genocol="grey", #colors for dosage 0-ploidy, NA=ploidy+1
pch=1, #point character, 1 is open circle
cex=1, #relative size, 1 is default
sel.samples=as.character(unique(XYdat$SampleName)),
omit.pch=".", #point character for omitted samples
omit.col=c(rep("black",length(genocol)-1),"grey"),
sample.groups=list(), #groups of samples to highlight
groups.col="black", #a color for each group
groups.pch=1, #a symbol for each group
groups.cex=1, #relative size of symbols for each group
groups.rnd=FALSE, #randomize order in which group samples are drawn
R.col="black", #colors of the R thresholds
R.lty=1,
drawRthresholds=FALSE,
Rthreshold.param=c(0.95, 0.5, 0),
ploidy) {
#Draws a series of pages with 6 XYplots each, one XYplot per marker in
#markers. The individual plots are drawn by function XY_plot.
#dat: a data frame with at least columns MarkerName, SampleName, X and Y;
# column geno (if present) is also used.
#markers: a vector with names of markers to plot, or marker numbers;
# default NA means all
#out: base for filenames of output, will be extended with _pagenumber.png
#drawRthresholds: if TRUE, R threshold lines are drawn according to:
#Rthreshold.param: either a list of vectors each of length 3 (for multiple
# R thresholds) or one vector of length 3 (for one R threshold).
# the first element of the vector is the R quantile P-value,
# the 2nd and 3rd are Rfactor and lowR as in
# calc.R.threshold.
#The other parameters are as in XY_plot
#close devices - extra strong:
allDevOff()
allmarkers <- as.character(unique(dat$MarkerName))
if (length(allmarkers) == 0) stop("no markers in dat")
if ("geno" %in% names(dat) && missing(ploidy))
stop("ploidy must be specified")
if (is.data.frame(markers)) {
#markernames must be in a column MarkerName, and if there is a column
#shift, this shift is retained
#This new (20170113) alternative allows to supply a selection of rows from
#a checkF1 output to specify markers and possibly shifts
if (!("MarkerName" %in% names(markers)))
stop("no column 'MarkerName' in markers data.frame")
if (!is.character(markers$MarkerName) && !is.factor(markers$MarkerName))
stop("column Markername of markers must contain marker names")
notNA <- !is.na(markers$MarkerName)
if (sum(notNA) == 0) {
#no markers specified, use all:
mrkdf <- data.frame(MarkerName=allmarkers, shift=0, stringsAsFactors=FALSE)
} else {
if ("shift" %in% names(markers)) {
if (!("geno" %in% names(dat))) stop("markers has a column 'shift' but dat has no column 'geno'")
if (anyNA(markers$shift[notNA])) stop("no missing values allowed in column 'shift' of markers data.frame")
if (!is.numeric(markers$shift) || any(as.integer(markers$shift) != markers$shift))
stop("column 'shift' of markers data.frame must contain integer shift values")
if (!is.null(ploidy) && !all(markers$shift %in% (-ploidy):ploidy))
stop("invalid values in column 'shift' of markers data.frame")
mrkdf <- data.frame(MarkerName=trim_shf(markers$MarkerName[notNA]),
shift=markers$shift[notNA], stringsAsFactors=FALSE)
} else mrkdf <- data.frame(MarkerName=trim_shf(markers$MarkerName[notNA]),
shift=0, stringsAsFactors=FALSE)
}
} else {
#markers not a data.frame
if (is.list(markers) || !is.vector(markers))
stop("markers must be a character vector of marker names, or a data.frame with a column MarkerName")
#now markers is a vector
notNA <- !is.na(markers)
if (sum(notNA) == 0) {
#no markers specified, use all:
mrkdf <- data.frame(MarkerName=allmarkers, shift=0, stringsAsFactors=FALSE)
} else {
if (!is.character(markers) && !is.factor(markers))
stop("markers must contain marker names")
mrkdf <- data.frame(MarkerName=trim_shf(markers[notNA]), shift=0, stringsAsFactors=FALSE)
}
}
#now the mrkdf data.frame has been created; exclude markers not in allmarkers (but retain duplicated):
pres <- mrkdf$MarkerName %in% allmarkers
if (sum(pres) < nrow(mrkdf)) {
if (sum(pres) == 0) stop("none of the specified markers occurs in dat")
mrkdf <- mrkdf[pres,]
}
#create small subset of dat to select markers from:
if (nrow(mrkdf) < 0.5 * length(allmarkers)) {
dat <- dat[dat$MarkerName %in% mrkdf$MarkerName,]
}
maxpage <- ceiling(nrow(mrkdf) / 6)
m.start <- 1
m.end <- nrow(mrkdf)
nr_on_page <- 0
for (m in m.start:m.end) {
mrknr <- padded(m, m.end)
nr_on_page <- nr_on_page+1
if (nr_on_page == 7) nr_on_page <- 1
if (nr_on_page == 1) {
pagenr <- padded((m-1) %/% 6 + 1, maxpage)
png(paste(out, "_", pagenr, ".png", sep=""),
width=7.2, height=11.7, units="in", res=96)
layout(matrix(1:6, byrow=TRUE, ncol=2)) # 3*2 plots on page
}
XYdat <- dat[dat$MarkerName == mrkdf$MarkerName[m],]
if ("R" %in% names(XYdat)) XYR <- XYdat$R else XYR <- XYdat$X + XYdat$Y
if (drawRthresholds) {
if (!is.list(Rthreshold.param)) Rthreshold.param <- list(Rthreshold.param)
R.thresholds <- rep(NA, length(Rthreshold.param))
for (i in seq_along(Rthreshold.param)) {
Rquantile <- quantile(XYR, Rthreshold.param[[i]][1], na.rm=TRUE)
R.thresholds[i] <- calc.R.threshold(Rquantile,
Rthreshold.param[[i]][2],
Rthreshold.param[[i]][3])
}
}
else R.thresholds <- NA
XY_plot(title=paste(mrknr, mrkdf$MarkerName[m]), XYdat=XYdat,
shift=mrkdf$shift[m], ploidy=ploidy,
genocol=genocol, pch=pch, cex=cex,
sel.samples=sel.samples, omit.pch=omit.pch, omit.col=omit.col,
sample.groups=sample.groups, groups.col=groups.col,
groups.pch=groups.pch, groups.cex=groups.cex, groups.rnd=groups.rnd,
R.thresholds=R.thresholds, R.col=R.col, R.lty=R.lty)
#if last plot on page, close device:
if (nr_on_page == 6 || m >= m.end) {
#close devices - extra strong:
allDevOff()
}
} # for m
invisible(NULL)
} #drawXYplots
allDevOff <- function() {
#close all graphics devices - extra strong
#dvof <- 2
#while (!is.null(dvof) && dvof > 1) dvof <- dev.off()
while (!is.null(dev.list())) dev.off()
}
# fitPoly input and output files **********************************************
#'@title Make input files for fitPoly containing only selected rows and columns
#'@description Remove entire markers and ratios for samples within markers
#'from fitPoly inputfiles if their R levels are below some thresholds;
#'also keep only the MarkerName, SampleName, X, Y, R and ratio columns
#'@usage makeFitPolyFiles(datpoly, datdiplo=NA, out,
#'filetype=c("dat", "RData")[2], Rquantile=0.95, marker.threshold,
#'Rthreshold.param=c(0.95, 0.5, 0))
#'@param datpoly data.frame as produced by readFullDataTable, readAxiomSummary
#'or splitNrenameSamples, with at least columns MarkerName, SampleName, R and
#'ratio
#'@param datdiplo data.frame as produced by splitNrenameSamples, with at least
#'columns MarkerName, SampleName, R and ratio, containing the data for the
#'diploid samples; or NA (default)
#'@param out output files will be named out + _poly.RData and _diplo.RData
#'(or .dat if filetype is set to "dat"). If NA no files are written
#'@param filetype "RData" (default) or "dat": the format to save the data in
#'@param Rquantile a value between 0 and 1: the R (= X + Y) quantile on which
#'selection of markers is based
#'@param marker.threshold the minimum value of the Rquantile for a marker to be
#'selected
#'@param Rthreshold.param a vector of length 3, defining an R threshold for
#'selecting samples within each marker.
#'The first number is the R quantile, the second is a number to multiply that
#'R quantile with, and the third is the minimum value of the result.
#'The default of c(0.95, 0.5, 0) means that 0.5 * the 95% quantile of R is used
#'(which is always higher than 0, the minimum result). This is often a good
#'cut-off value to discard samples within a marker.
#'@details All rows for markers not meeting the marker.threshold for the
#'R quantile are removed from the original data.frames. For samples within
#'the remaining markers where the R value is below the marker-specific threshold
#'the ratio is set to NA, the row for that sample is not removed.
#'The same thresholds are applied to datpoly and datdiplo.
#'@return a list with two elements: datpoly is a filtered version of
#'parameter datpoly, and if param datdiplo was specified the second element
#'of the returned list is datdiplo: a filtered version of parameter datdiplo.
#'Either or both may have 0 rows left after filtering. If no datdiplo was
#'specified the second element of the list is NA.\cr
#'If parameter out is specified the element(s) of this list are also saved
#'as files.
#'@export
makeFitPolyFiles <- function(datpoly, datdiplo=NA,
out, filetype=c("dat", "RData")[2],
Rquantile=0.95, marker.threshold,
Rthreshold.param=c(0.95, 0.5, 0)) {
cols <- match(c("MarkerName", "SampleName", "R", "ratio"),
names(datpoly))
if (anyNA(cols))
stop("required columns missing from datpoly")
savediplo <- is.data.frame(datdiplo)
if (savediplo) {
cols <- match(c("MarkerName", "SampleName", "R", "ratio"),
names(datdiplo))
if (anyNA(cols))
stop("required columns missing from datdiplo")
}
out <- out[1]
if (!is.na(out)) {
testfilename <- ifelse(filetype=="dat",
paste(out,"poly.dat", sep="_"),
paste(out,"poly.RData", sep="_"))
if (!check.filename(testfilename))
stop("out is not valid")
}
quant <- function(x, Rquantile) { #function nested within make.fitPoly.files
quantile(x, probs=Rquantile, na.rm=TRUE, names=FALSE)
}
Rq <- tapply(datpoly$R, INDEX=datpoly$MarkerName, FUN=quant, Rquantile)
#remove all markers where Rq is below threshold,
#and retain only relevant columns:
#(works well even if marker names are not valid identifiers)
selmarkers <- names(Rq)[!is.na(Rq) & Rq >= marker.threshold]
#remove these markers from datpoly:
datpoly <- datpoly[datpoly$MarkerName %in% selmarkers,]
datpoly$MarkerName <- factor(datpoly$MarkerName) #drop unused levels
#remove same markers from Rq:
Rq <- Rq[!is.na(Rq) & Rq >= marker.threshold]
#calculate R threshold for samples within each remaining marker:
if (Rthreshold.param[1] == Rquantile) Sq <- Rq else {
Sq <- tapply(datpoly$R, INDEX=datpoly$MarkerName, FUN=quant,
Rquantile=Rthreshold.param[1])
}
Sq <- calc.R.threshold(Sq, Rthreshold.param[2], Rthreshold.param[3])
Tq <- Sq[match(datpoly$MarkerName, names(Sq))]
datpoly$ratio[datpoly$R < Tq] <- NA
result <- list(datpoly=datpoly)
if (savediplo) {
datdiplo <- datdiplo[datdiplo$MarkerName %in% selmarkers,]
datdiplo$MarkerName <- factor(datdiplo$MarkerName) #drop unused levels
Tq <- Sq[match(datdiplo$MarkerName, names(Sq))]
datdiplo$ratio[datdiplo$R < Tq] <- NA
result$datdiplo <- datdiplo
} else result$datdiplo <- NA
if (!is.na(out)) {
if (filetype=="dat") {
writeDatfile(datpoly, paste(out,"poly.dat", sep="_"))
if (savediplo) {
writeDatfile(datdiplo, paste(out,"diplo.dat", sep="_"))
}
} else {
#write RData files:
save(datpoly, file=paste(out,"poly.RData", sep="_"))
if (savediplo) {
save(datdiplo, file=paste(out,"diplo.RData", sep="_"))
}
}
}
invisible(result)
} #makeFitPolyFiles
# some general functions ******************************************************
check.filename <- function(filename, overwrite=TRUE) {
#Check if a filename is valid for writing by trying to create the file
#If the filename contains a path, result will only be TRUE if the whole
#path already exists
#overwrite: if TRUE (default) an existing file of that name will be deleted
# (if it is not protected by being locked, read-only, owned by
# another user etc)
filename <- filename[1] # we test only the first filename
if (!overwrite && file.exists(filename)) return(FALSE)
tryCatch(
{ suppressWarnings(
if (file.create(filename)) {
file.remove(filename)
TRUE
} else FALSE)
}, error = function(e) { FALSE }
)
} #check.filename
#'@title User-friendly wrapper for read.table
#'@description A wrapper for read.table that has default parameter values for
#'reading tab-separated files as used in packages fitPoly and fitPolyTools
#'@usage readDatfile(file, header=TRUE, sep="\t", check.names=FALSE, ...)
#'@param file the name of the file which the data are to be read from
#'@param header a logical value indicating whether the file contains the names
#'of the data.frame columns as its first line
#'@param sep the field separator character
#'@param check.names logical. If FALSE (default), column names are not checked.
#'This is important if column names are the names of samples, markers etc that
#'may not be syntactically valid variable names.\cr
#'If TRUE then the names of the variables in the data frame are checked to
#'ensure that they are syntactically valid variable names. If necessary they
#'are adjusted (by make.names) so that they are, and also to ensure that there
#'are no duplicates
#'@param ... Further arguments to be passed to read.table
#'@return A data.frame containing a representation of the data in the file
#'@export
readDatfile <- function(file, header=TRUE, sep="\t", check.names=FALSE, ...) {
#calls read.table with changed defaults, for reading a tab-separated
#text file, and without modifying column names
#The only compulsory argument is the file name
return(read.table(file=file, header=header, sep=sep,
check.names=check.names, ...))
} #readDatfile
#'@title User-friendly wrapper for write.table
#'@description A wrapper for write.table that has default parameter values for
#'writing tab-separated files as used in packages fitPoly and fitPolyTools
#'@usage writeDatfile(x, file, quote=FALSE, sep="\t", na="",
#'row.names=FALSE, col.names=TRUE, logical_01=FALSE, ...)
#'@param x the object to be written, preferably a matrix or data frame.
#'If not, it is attempted to coerce x to a data frame.
#'@param file either a character string naming a file or a connection open for
#'writing. "" indicates output to the console.
#'@param quote a logical value (TRUE or FALSE) or a numeric vector. If TRUE,
#'any character or factor columns will be surrounded by double quotes.
#'If a numeric vector, its elements are taken as the indices of columns to
#'quote. In both cases, row and column names are quoted if they are written.
#'If FALSE (default), nothing is quoted.
#'@param sep the field separator string. Values within each row of x are
#'separated by this string.
#'@param na the string to use for missing values in the data
#'@param row.names either a logical value indicating whether the row names of x
#'are to be written along with x, or a character vector of row names to be
#'written.
#'@param col.names either a logical value indicating whether the column names
#'of x are to be written along with x, or a character vector of column names
#'to be written.
#'@param logical_01 FALSE (default) or TRUE; if TRUE, logical columns of x
#'are converted to 0/1/NA numeric values before writing the file
#'@param ... Further arguments to be passed to write.table
#'@return NULL
#'@export
writeDatfile <- function(x, file, quote=FALSE, sep="\t", na="",
row.names=FALSE, col.names=TRUE,
logical_01=FALSE, ...) {
#calls write.table with changed defaults, for writing a tab-separated
#text file
#The only compulsory arguments are x and the file name
if (logical_01) {
#convert logicals to 0/1/NA for writing text file
if (is.logical(x)) x[x] <- 1 else if (is.data.frame(x)) {
for (col in seq_len(ncol(x))) {
if ("logical" %in% class(x[, col])) {
xc <- x[,col]
xc[xc] <- 1 #with NAs present we cannot do x[x[,col],col] <- 1
x[,col] <- xc
}
}
}
}
write.table(x=x, file=file, quote=quote, sep=sep, na=na,
row.names=row.names, col.names=col.names, ...)
} #writeDatfile
#padded returns a string representing positive integer x left-padded with 0's
#to the length of integer maxx or of max(x)
padded <- function(x, maxx=0) {
formatC(x, width=nchar(max(x, maxx, na.rm=TRUE)), flag="0")
}
#*************************************************************************
gcd <- function(x,y) {
#return the greatest common divisor of two integers
#(vectorized: x and y may be vectors)
r <- x %% y;
return(ifelse(r, gcd(y, r), y))
} #gcd
gcd_all <- function(x) {
#return the greatest common divisor of all elements of vector x
if (length(x) == 1) x else {
if (length(x) == 2) gcd(x[1], x[2]) else
gcd_all(c(gcd(x[1], x[2]), x[3:length(x)]))
}
} #gcd_all
polygamfrq <- function(ploidy, dosage) {
#get the integer ratios of gametes with dosage 0:(ploidy/2)
#from an autopolyploid parent (polysomic inheritance)
#with given ploidy and dosage
#the result can be converted to fractions by y <- x/(sum(x))
gp <- ploidy/2 #gamete ploidy
#using hypergeometric distribution:
#dhyper(0:gp, dosage, ploidy-dosage, gp)
#using n-over-k: function choose, in order to return integer ratios:
A <- choose(dosage, 0:gp)
B <- choose(ploidy-dosage, gp-(0:gp))
#C <- choose(ploidy, gp)
#A * B / C this would give the answer as fractions,
# like dhyper(0:gp, dosage, ploidy-dosage, gp)
D <- round(A*B)
d <- gcd_all(D[D>0])
round(D/d)
}
digamfrq <- function(ploidy, dosage) {
#get the integer ratios of gametes with dosage 0:(ploidy/2)
#from an allopolyploid parent (disomic inheritance)
#with given ploidy and dosage
gp <- ploidy/2 #gamete ploidy
#first step: calculate all possible compositions of diploid parental genomes
#i.e. the numbers of subgenomes with nulliplex, simplex and duplex dosages
maxdup <- dosage %/% 2 #max nr of subgenomes with duplex dosage
mindup <- max(dosage-gp, 0) #min nr of subgenomes with duplex dosage
genomes <- matrix(integer((maxdup-mindup+1)*3), ncol=3)
colnames(genomes) <- c("nulli", "sim", "du")
for (i in 1:nrow(genomes)) {
genomes[i,3] <- mindup + i - 1
genomes[i,2] <- dosage - 2*genomes[i,3]
genomes[i,1] <- gp - sum(genomes[i, 2:3])
}
#next step: per genomic composition calculate the gamete distributions:
gam <- matrix(numeric(nrow(genomes)*(gp+1)), nrow=nrow(genomes)) #all 0
colnames(gam) <- 0:gp
for (i in 1:nrow(genomes)) {
mingamdos <- genomes[i,3] #each duplex genome contributes 1
maxgamdos <- gp-genomes[i,1] #each nulliplex genome contributes 0
bincoeff <- choose(genomes[i,2], 0:genomes[i,2])
# to return fractions:
#gam[i, (mingamdos+1):(maxgamdos+1)] <- bincoeff / sum(bincoeff)
#to return integer ratios:
gam[i, (mingamdos+1):(maxgamdos+1)] <- bincoeff
#the smallest bincoeff is always 1 so division by the gcd is not needed
}
#list(genomes, gam)
gam
} #digamfrq
makeProgeny <- function(gametes1, gametes2) {
#gametes1, gametes2: vectors of the integer ratios of the gametes
# produced by the two parents
#returns a vector with the integer ratios of the F1 progeny
gp1 <- length(gametes1)-1 #gametes1 ploidy
gp2 <- length(gametes2)-1 #gametes2 ploidy
m <- matrix(0, nrow=gp1+gp2+1, ncol=gp2+1)
for (i in 1:(gp2+1))
m[i:(i+gp1), i] <- gametes1 * gametes2[i]
D <- round(rowSums(m)) #the integer ratios of the F1 generation
names(D) <- 0:(length(D)-1)
d <- gcd_all(D[D>0])
round(D/d) #the integer ratios of the F1 generation simplified
} #makeProgeny
# calcSegtypeInfo ***********************************************************
#'@title Build a list of segregation types
#'@description For each possible segregation type in an F1 progeny with given
#'parental ploidy (and ploidy2, if parent2 has a different ploidy than parent1)
#'information is given on the segregation ratios, parental dosages and whether
#'the segregation is expected under polysomic, disomic and/or mixed inheritance.
#'
#'@usage calcSegtypeInfo(ploidy, ploidy2=NULL)
#'
#'@param ploidy The ploidy of parent 1 (must be even, 2 (diploid) or larger).
#'@param ploidy2 The ploidy of parent 2. If omitted (default=NULL) it is
#'assumed to be equal to ploidy.
#'@details The names of the segregation types consist of a short sequence of
#'digits (and sometimes letters), an underscore and a final number. This is
#'interpreted as follows, for example segtype 121_0: 121 means that there
#'are three consecutive dosages in the F1 population with frequency ratios 1:2:1,
#'and the 0 after the underscore means that the lowest of these dosages is
#'nulliplex. So 121_0 means a segregation of 1 nulliplex : 2 simplex : 1 duplex.
#'A monomorphic F1 (one single dosage) is indicated as e.g. 1_4 (only one
#'dosage, the 4 after the underscore means that this is monomorphic quadruplex).
#'If UPPERCASE letters occur in the first part of the name these are interpreted
#'as additional digits with values of A=10 to Z=35, e.g. 18I81_0 means a
#'segregation of 1:8:18:8:1 (using the I as 18), with the lowest dosage being
#'nulliplex.\cr
#'With higher ploidy levels higher numbers (above 35) may be required.
#'In that case each unique ratio number above 35 is assigned a lowercase letter.
#'E.g. one segregation type in octaploids is 9bcb9_2: a 9:48:82:48:9
#'segregation where the lowest dosage is duplex.\cr
#'Segregation types with more than 5 dosage classes are considered "complex"
#'and get codes like c7e_1 (again in octoploids): this means a complex type
#'(the first c) with 7 dosage classes; the e means that this is the fifth
#'type with 7 classes. Again the _1 means that the lowest dosage is simplex.
#'It is always possible (and for all segtype names with lowercase letters it is
#'necessary) to look up the actual segregation ratios in the intratio item
#'of the segtype. For octoploid segtype c7e_1 this shows 0:1:18:69:104:69:18:1:0
#'(the two 0's mean that nulli- and octoplexes do not occur).
#'
#'@return A list with for each different segregation type (segtype) one item.
#'The names of the items are the names of the segtypes.
#'Each item is itself a list with components:
# (alternatives for itemize: enumerate and describe,
# see https://cran.r-project.org/web/packages/roxygen2/vignettes/formatting.html)
#'\itemize{
#'\item{freq: a vector of the ploidy+1 fractions of the dosages in the F1}
#'\item{intratios: an integer vector with the ratios as the simplest integers}
#'\item{expgeno: a vector with the dosages present in this segtype}
#'\item{allfrq: the allele frequency of the dosage allele in the F1}
#'\item{polysomic: boolean: does this segtype occur with polysomic inheritance?}
#'\item{disomic: boolean: does this segtype occur with disomic inheritance?}
#'\item{mixed: boolean: does this segtype occur with mixed inheritance (i.e. with
#'polysomic inheritance in one parent and disomic inheritance in the other)?}
#'\item{pardosage: integer matrix with 2 columns and as many rows as there
#'are parental dosage combinations for this segtype;
#'each row has one possible combination of dosages for
#'parent 1 (1st column) and parent 2 (2nd column)}
#'\item{parmode: logical matrix with 3 columns and the same number of rows as
#'pardosage. The 3 columns are named polysomic, disomic and mixed and
#'tell if this parental dosage combination will generate this
#'segtype under polysomic, disomic and mixed inheritance}
#'}
#'@examples
#'si4 <- calcSegtypeInfo(ploidy=4) # two 4x parents: a 4x F1 progeny
#'print(si4[["11_0"]])
#'
#'si3 <- calcSegtypeInfo(ploidy=4, ploidy2=2) # a 4x and a diplo parent: a 3x progeny
#'print(si3[["11_0"]])
#'@export
calcSegtypeInfo <- function(ploidy=4, ploidy2=NULL) {
if (ploidy %% 2 != 0) stop("calcSegtypeInfo: odd ploidy not allowed")
if (is.null(ploidy2)) ploidy2 <- ploidy else
if (ploidy2 %% 2 != 0) stop("calcSegtypeInfo: odd ploidy2 not allowed")
ploidyF1 <- (ploidy + ploidy2) / 2
result <- list()
#we check all parental combinations and enumerate the segtypes:
for (p1 in 0:ploidy) {
if (p1 %in% c(0, 1, ploidy-1, ploidy)) {
p1ir <- matrix(polygamfrq(ploidy, p1), nrow=1)
p1mode <- 0 #0=don't care
} else {
p1ir <- rbind(polygamfrq(ploidy, p1), digamfrq(ploidy, p1))
p1mode <- c(1, rep(2, nrow(p1ir)-1)) #1=polysomic, 2=disomic
}
if (ploidy2 == ploidy) maxp2 <- p1 else maxp2 <- ploidy2
for (p2 in 0:maxp2) {
if (p2 %in% c(0, 1, ploidy2-1, ploidy2)) {
p2ir <- matrix(polygamfrq(ploidy2, p2), nrow=1)
p2mode <- 0 #0=don't care
} else {
p2ir <- rbind(polygamfrq(ploidy2, p2), digamfrq(ploidy2, p2))
p2mode <- c(1, rep(2, nrow(p2ir)-1)) #1=polysomic, 2=disomic
}
for (i1 in 1:nrow(p1ir)) for (i2 in 1:nrow(p2ir)) {
ir <- makeProgeny(p1ir[i1,], p2ir[i2,])
#check if this segtype already exists in result:
r <- length(result)
while (r > 0 && sum(ir != result[[r]]$intratios) > 0) r <- r-1
if (r == 0) {
#new segtype:
r <- length(result) + 1
result[[r]] <- list()
result[[r]]$freq <- ir/sum(ir)
result[[r]]$intratios <- ir
result[[r]]$expgeno <- which(ir > 0) - 1; names(result[[r]]$expgeno) <- NULL
result[[r]]$allfrq <- sum(result[[r]]$freq * 0:ploidyF1) / ploidyF1
result[[r]]$polysomic <- FALSE
result[[r]]$disomic <- FALSE
result[[r]]$mixed <- FALSE
result[[r]]$pardosage <- matrix(c(p1, p2), nrow=1)
colnames(result[[r]]$pardosage) <- c("P1", "P2")
result[[r]]$parmode <- matrix(FALSE, nrow=1, ncol=3)
colnames(result[[r]]$parmode) <- c("polysomic", "disomic", "mixed")
p <- 1
} else {
#segtype already existed, see if parental combination
#already existed:
p <- nrow(result[[r]]$pardosage)
while (p > 0 &&
sum(c(p1,p2) != result[[r]]$pardosage[p,]) > 0) p <- p-1
if (p == 0) {
#this parental dosage combination is new
p <- nrow(result[[r]]$pardosage) + 1
result[[r]]$pardosage <- rbind(result[[r]]$pardosage, c(p1, p2))
result[[r]]$parmode <- rbind(result[[r]]$parmode,
rep(FALSE, 3))
}
}
result[[r]]$parmode[p, 1] <- result[[r]]$parmode[p, 1] ||
(p1mode[i1] != 2 && p2mode[i2] != 2) #is polysomic?
result[[r]]$parmode[p, 2] <- result[[r]]$parmode[p, 2] ||
(p1mode[i1] != 1 && p2mode[i2] != 1) #is disomic?
result[[r]]$parmode[p, 3] <- result[[r]]$parmode[p, 3] ||
(p1mode[i1] == 0 || p2mode[i2] == 0 || p1mode[i1] != p2mode[i2]) #mixed?
} #for i1 for i2
} #for p2
} #for p1
#next we check if each segtype occurs with poly- or disomic or
#mixed inheritance,
#we add all reciprocal parental dosages,
#we order all pardosage and parmode in order of increasing p1,
#we check which intratios occur in segtypes with less than complex classes,
#and we compile the numdosages and firstdosage statistics:
numdosages <- firstdosage <- maxratio <- integer(length(result))
complex <- 6 #the number of dosage classes from which we consider the segtype
# complex, i.e. we don't show the full segregation in the name
simpleratios <- integer(0)
for (r in seq_along(result)) {
#calculate poly- and disomic and mixed:
result[[r]]$polysomic <- sum(result[[r]]$parmode[, 1]) > 0
result[[r]]$disomic <- sum(result[[r]]$parmode[, 2]) > 0
result[[r]]$mixed <- sum(result[[r]]$parmode[, 3]) > 0
#add the reverse of parents with unequal dosage:
if (ploidy2 == ploidy) {
#if ploidy2 not null, all parental combinations already done
uneq <- result[[r]]$pardosage[,1] != result[[r]]$pardosage[,2]
if (sum(uneq) > 0) {
#there are rows with unequal parental dosage
result[[r]]$pardosage <- rbind(result[[r]]$pardosage,
result[[r]]$pardosage[uneq, 2:1])
result[[r]]$parmode <- rbind(result[[r]]$parmode,
result[[r]]$parmode[uneq,])
}
}
#sort pardosage and parmode in order of increasing P1 dosage:
or <- order(result[[r]]$pardosage[,1])
result[[r]]$pardosage <- result[[r]]$pardosage[or,, drop=FALSE]
result[[r]]$parmode <- result[[r]]$parmode[or,, drop=FALSE]
#add the intratios in not complex:
if (length(result[[r]]$expgeno) < complex)
simpleratios <- union(simpleratios, result[[r]]$intratios)
#calculate statistics for later ordering of result list:
numdosages[r] <- length(result[[r]]$expgeno)
firstdosage[r] <- result[[r]]$expgeno[1]
maxratio[r] <- max(result[[r]]$intratios)
}
#next we reorder result from simple to complex segtypes
or <- order(numdosages, maxratio, firstdosage)
result <- result[or]
#finally we assign names according to numdosages and firstdosage:
numdosages <- numdosages[or]
firstdosage <- firstdosage[or]
maxratio <- maxratio[or]
simple <- max(which(numdosages < complex))
alldigits <- c("0","1","2","3","4","5","6","7","8","9", LETTERS)
#we translate the intratios of non-complex segtypes to a sequence of digits
#and uppercase letters (for ratios 10:35); if any ratios > 35 occur (as in
#octo- and dodecaploids) we assign one of the lowercase letters for each of
#these dosages (and assume there are less than 27 unique ratios above 35).
#Just using the lowercase letter for ratios 36:61 does not work as already
#with octoploids some intratios of non-complex segtypes are above 61.
simpleratios <- sort(simpleratios[simpleratios > 35])
for (r in 1:simple) {
chr <- character(complex-1)
name <- ""
ir <- result[[r]]$intratios
for (i in 1:(ploidyF1+1)) {
if (ir[i] > 0) {
if (ir[i] <= 35) {
name <- paste(name, alldigits[ir[i] + 1], sep="")
} else name <- paste(name, letters[which(simpleratios == ir[i])], sep="")
}
}
names(result)[r] <- paste(name, firstdosage[r], sep="_")
}
if (complex <= max(numdosages)) for (n in complex:max(numdosages)) {
#we want to assign codes like c6a_1, where:
#c just means complex
#6 is the number of different dosages in the segtype
#c6a, c6b etc are segtypes with different ratio combinations, all with 6 dosages
#_1 means that the first dosage is simplex
selnum <- numdosages == n
maxrat <- sort(unique(maxratio[selnum]))
for (m in 1:length(maxrat)) {
seltype <- selnum & maxratio == maxrat[m]
if (length(maxrat) == 1) name <- paste("c", n, sep="") else
name <- paste("c", n, letters[m], sep="")
firstrange <- sort(unique(firstdosage[selnum]))
for (f in firstrange) {
sel <- seltype & firstdosage == f
names(result)[sel] <- paste(name, f, sep="_")
}
}
}
result
} #calcSegtypeInfo
#'@title conversion of segtype code to F1 segregation ratios
#'@description Produce a matrix with the F1 segregation ratios (as integers)
#'for all segregation types for the given ploidy
#'@usage listSegtypes(ploidy, ploidy2=NULL)
#'@param ploidy the ploidy of parent 1 of the F1, or of the F1 itself if
#'ploidy2 is NULL
#'@param ploidy2 the ploidy of parent 2 of the F1
#'@details If ploidy2 is not NULL, ploidy and ploidy2 are the ploidy levels of
#'the two parents, and both must be even. If ploidy2 is NULL, ploidy is
#'the ploidy of the F1; if even, both parents are assumed to have the same
#'ploidy; if odd, parent 1 and parent 2 are asumed to have ploidy-1 and ploidy+1.
#'This function calls calcSegtypeInfo and is a convenience function; it is not
#'used by any other functions. For more information, including parental
#'dosages for each segregation, use calcSegtypeInfo and segtypeInfoSummary.
#'@return a matrix with one row for each segregation type and one column for
#'each possible F1 dosage, with the integer ratios of the dosages for each
#'segregation type.
#'@export
listSegtypes <- function(ploidy, ploidy2=NULL) {
if (is.null(ploidy2)) {
if (ploidy %% 2 != 0) {
#odd ploidy
ploidy2 <- ploidy +1
ploidy <- ploidy - 1
} else ploidy2 <- ploidy
} else {
if (ploidy %% 2 != 0 || ploidy2 %%2 != 0)
stop("listSegtypes: odd parental ploidy not allowed")
}
ploidyF1 <- (ploidy+ploidy2)/2
seginfo <- calcSegtypeInfo(ploidy=ploidy, ploidy2=ploidy2)
result <- matrix(nrow=length(seginfo), ncol=ploidyF1 + 1,
dimnames=list(names(seginfo), 0:ploidyF1))
for (i in 1:length(seginfo))
result[i,] <- seginfo[[i]]$intratios
result
} #listSegtypes
# selSegtypeInfo ***********************************************************
#'@title Restrict a list of segregation types to specified inheritance modes
#'@description From a list of segregation types as produced by calcSegtypeInfo,
#'this function selects only those segtypes that occur with polysomic,
#'disomic and/or mixed inheritance if the corresponding parameters are set to
#'TRUE, and from those segtypes only the parental dosages with the same
#'restrictions are retained.
#'@usage selSegtypeInfo(segtypeInfo, polysomic, disomic, mixed,
#'selfing=FALSE)
#'@param segtypeInfo a list as returned by calcSegtypeInfo
#'@param polysomic If TRUE all segtypes with poly TRUE are retained, and from
#'those segtypes all parental dosage combinations with parmode[,1] TRUE
#'@param disomic If TRUE all segtypes with di TRUE are retained, and from
#'those segtypes all parental dosage combinations with parmode[,2] TRUE
#'@param mixed If TRUE all segtypes with mixed TRUE are retained, and from
#'those segtypes all parental dosage combinations with parmode[,3] TRUE
#'@param selfing if TRUE only segtypes are retained that can be the result of
#'a selfing, i.e. both parental dosages equal (although it allows disomic
#'segtypes where the parental distribution of alleles over the subgenomes is
#'different, e.g. aa/bb x ab/ab -> 121_1); default FALSE
#'@return A list like segtypeInfo, modified as specified by parameters polysomic,
#'disomic and mixed
#'@export
selSegtypeInfo <- function(segtypeInfo, polysomic, disomic, mixed,
selfing=FALSE) {
result <- list()
selected <- integer(0)
for (s in seq_along(segtypeInfo)) {
if ((polysomic && segtypeInfo[[s]]$polysomic) ||
(disomic && segtypeInfo[[s]]$disomic) ||
(mixed && segtypeInfo[[s]]$mixed)) {
stype <- segtypeInfo[[s]]
selpar <- (polysomic & stype$parmode[,1]) |
(disomic & stype$parmode[,2]) |
(mixed & stype$parmode[,3])
stype$pardosage <- stype$pardosage[selpar,, drop=FALSE]
stype$parmode <- stype$parmode[selpar,, drop=FALSE]
if (selfing) {
selpar <- stype$pardosage[,1] == stype$pardosage[,2]
stype$pardosage <- stype$pardosage[selpar,, drop=FALSE]
stype$parmode <- stype$parmode[selpar,, drop=FALSE]
}
if (nrow(stype$pardosage) > 0) {
selected <- c(selected, s)
r <- length(result) + 1
result[[r]] <- stype
}
}
}
names(result) <- names(segtypeInfo)[selected]
result
} #selSegtypeInfo
# segtypeInfoSummary *********************************************************
#'@title Summarize the segtypeInfo list
#'@description From a list of segregation types as produced by calcSegtypeInfo
#'or selSegtypeInfo, produce a data frame that only lists the parental
#'dosage combinations for each segtype and whether these produce the
#'segtype under polysomic, disomic and/or mixed inheritance.
#'Useful to quickly look up which segtypes match a given parental dosage
#'combination.
#'@usage segtypeInfoSummary(segtypeInfo)
#'
#'@param segtypeInfo a list as returned by calcSegtypeInfo or selSegtypeInfo
#'@return A data frame summarizing the segtypeInfo list, with columns:
#'\itemize{
#'\item{segtype: the name of the segregation type (see details of
#'calcSegtypeInfo)}
#'\item{segtypenr: the sequential number of the segtype in parameter segtypeInfo}
#'\item{parent1, parent2: dosages of the two parents}
#'\item{par.poly, par.di, par.mixed: whether these parental dosages produce this
#'segtype under polysomic, disomic and/or mixed inheritance}
#\item{segtype.poly, segtype.di, segtype.mixed: whether this segtype does occur
#under polysomic, disomic and/or mixed inheritance}
#'}
#'@export
segtypeInfoSummary <- function(segtypeInfo) {
totrows <- 0
for (i in 1:length(segtypeInfo))
totrows <- totrows + nrow(segtypeInfo[[i]]$pardosage)
result <- data.frame (segtype=character(totrows),
segtypenr=integer(totrows),
parent1=integer(totrows),
parent2=integer(totrows),
par.poly=integer(totrows),
par.di=integer(totrows),
par.mixed=integer(totrows),
#segtype.poly=integer(totrows),
#segtype.di=integer(totrows),
#segtype.mixed=integer(totrows),
stringsAsFactors=FALSE)
r <- 1
for (i in 1:length(segtypeInfo)) {
for (p in 1:nrow(segtypeInfo[[i]]$pardosage)) {
result$segtype[r] = names(segtypeInfo)[i]
result$segtypenr[r] = i
result$parent1[r] = segtypeInfo[[i]]$pardosage[p,1]
result$parent2[r] = segtypeInfo[[i]]$pardosage[p,2]
result$par.poly[r] = segtypeInfo[[i]]$parmode[p,1]
result$par.di[r] = segtypeInfo[[i]]$parmode[p,2]
result$par.mixed[r] = segtypeInfo[[i]]$parmode[p,3]
#result$segtype.poly[r] = 0 + segtypeInfo[[i]]$polysomic
#result$segtype.di[r] = 0 + segtypeInfo[[i]]$disomic
#result$segtype.mixed[r] = 0 + segtypeInfo[[i]]$mixed
r <- r + 1
}
}
result
} #segtypeInfoSummary
getMatchParents <- function(parGeno, seginfoItem) {
#parGeno: integer vector of length 2 with the two parental (consensus)
# dosages (can each be NA)
#seginfoItem: one of the items (segtypes) of a list as returned by
# calcSegtypeInfo
#does the specified segtype match the parental genotypes?
p <- which(!is.na(parGeno))
if (length(p) == 0) {
#both parental genotypes unknown
return("Unknown")
} else if (length(p) == 1) {
#only one of the parental genotypes known
if (sum(seginfoItem$pardosage[, p] == parGeno[p]) > 0)
return("OneOK") else return("No")
} else {
#both parental genotypes known
i <- which(seginfoItem$pardosage[, 1] == parGeno[1])
if (length(i) == 1 && parGeno[2] == seginfoItem$pardosage[i, 2])
return("Yes") else return("No")
}
} #getMatchParents
getConsensusGeno <- function(geno, maxNAfrac=0.499,
lowconf.NAfrac=0.751) {
#geno: a vector with scores (0..ploidy or NA) for all samples of one parent.
#maxNAfrac: the maximum fraction of missing scores allowed to assign a
# high-confidence consensus geno score; if more are
# missing a missing geno score is returned (the default requires
# MORE than half of the samples to be scored, so one missing out of
# 2 samples already causes a missing geno)
#lowconf.NAfrac: if the fraction missing scores is more than maxNAfrac but
# less than lowconf.NAfrac, or if there is only one sample,
# a lowconf.geno score is assigned
#returns a list with 3 components:
# $geno: the consensus of the dosages in vector geno. NA if geno is empty,
# if all elements of geno are NA, if there are different non-NA
# values in geno, or if the fraction of NA values in geno larger than
# maxNAfrac
# $lowconf.geno: if there is no conflict and the fraction of missing scores
# is between maxNAfrac and lowconf.NAfrac, the consensus is assigned
# as a "low-confidence" geno - this needs confirmation from the
# F1 segregation to be accepted
# $conflict: TRUE if there are different non-NA values in geno, else FALSE
# $NAfrac: the fraction NA in geno (0.5 if length(geno)==0)
nwgeno <- NA; nwlogeno <- NA; conflict <- FALSE; NAfrac <- 0
if (length(geno) == 0) NAfrac <- 0.5 else { #we need a NAfrac value for q2
NAfrac <- sum(is.na(geno) / length(geno))
if (NAfrac < 1.0) {
if (length(geno) == 1) nwlogeno <- geno else {
if (max(geno, na.rm=TRUE) != min(geno, na.rm=TRUE)) {
conflict <- TRUE
} else if (NAfrac <= maxNAfrac) {
nwgeno <- min(geno, na.rm=TRUE)
} else if (NAfrac <= lowconf.NAfrac) {
nwlogeno <- min(geno, na.rm=TRUE)
}
}
}
}
list(geno=nwgeno, lowconf.geno=nwlogeno, conflict=conflict, NAfrac=NAfrac)
} #getConsensusGeno
shiftdosages <- function(dosages, shift, ploidy) {
#dosages: vector of integer dosages, shift: one shift value
dosages <- dosages + shift
below <- !is.na(dosages) & dosages < 0
dosages[below] <- 0
above <- !is.na(dosages) & dosages > ploidy
dosages[above] <- ploidy
dosages
} #shiftdosages
checkSelfing <- function(parent1, parent2) {
parent1 <- sort(as.character(parent1[!is.na(parent1)]))
parent2 <- sort(as.character(parent2[!is.na(parent2)]))
length(parent1) == length(parent2) && length(parent1) > 0 &&
sum(parent1 != parent2) == 0
} #checkSelfing
linterpol <- function(x, pnt1, pnt2) {
#linear interpolation: returns the y values matching the x values (vector)
#on the line through points pnt1 and pnt2 (both are vectors of length 2)
#possible error if both X-coordinates equal not checked or caught
a <- (pnt2[2] - pnt1[2]) / (pnt2[1] - pnt1[1])
b <- pnt1[2] - a * pnt1[1]
a * x + b
} #linterpol
calc_q1 <- function(Pvalue, fracInvalid, bestfit, bestParentfit,
Pvalue_threshold, fracInvalid_threshold) {
#q1 is a measure of the fit of selfit. It is based on Pvalue and the fraction
#of valid F1 scores. If either is below their threshold the q1 value is 0.
#Also, if bestParentfit is different from bestfit this indicates a segregation
#distortion.
#x compares bestfit and bestParentfit:
if (bestfit == bestParentfit) {
x <- 1
} else {
x <- 0.5
}
#Version 20150220: nonzero but low y values also below P=0.001 depending
#on threshold:
if (Pvalue < Pvalue_threshold) y <- 0 else {
if (Pvalue < 0.01) {
if (Pvalue < 0.001) y <- linterpol(Pvalue, c(0,0), c(0.001, 0.05)) else
y <- linterpol(Pvalue, c(0.001, 0.05), c(0.01, 0.6))
} else if (Pvalue < 0.15) {
if (Pvalue < 0.05) y <- linterpol(Pvalue, c(0.01, 0.6), c(0.05, 0.8)) else
y <- linterpol(Pvalue, c(0.05, 0.8), c(0.15, 1))
} else y <- 1
}
#z is determined by fraction F1 scores in valid categories for the selfit
#segtype. z is equal to 1-fracInvalid, but below 1-frqinvalid.threshold z=0.
z <- ifelse(fracInvalid > fracInvalid_threshold, 0, 1 - fracInvalid)
x * y * z
} #calc_q1
calc_q2 <- function(par.conflicts, par.NAfrac, matchParents,
parentsScoredWithF1) {
#q2 is about how well the parents are scored and fit with the selfit segtype
if (parentsScoredWithF1) {
#q2 is determined not by matchParents but by the amount
#of conflicts and missing values in the parents:
q2 <- max(0, 1 - 0.5 * sum(par.conflicts) - 0.5 * sum(par.NAfrac))
#In the version of 20150220 matchParents cannot be "No": selfit is always
#bestParentfit (sometimes after promoting lowconf parental scores).
#Therefore the remaining matchParents codes Unknown, OneOK and Yes are
#correlated with the q2 score, no point to use the matchParents
} else {
# !parentsScoredWithF1
# In this case par.conflicts and par.NAfrac are not used and the information
# on the match between parents and F1 segtype has to come from matchParents.
# Note that matchParents=="No" cannot occur in the versions since 20150220.
q2 <- match(matchParents,
c("No", "Unknown", "OneOK", "Yes")) #1:4 or NA
#q2 <- c(0, 0.5, 0.9, 1)[q2] #translate into 0..1 values
q2 <- c(0, 0.65, 0.9, 1)[q2] #translate into 0..1 values
#version of 20160324: the penalty for Unknown is less if the parents
#are not scored with the F1: there is still the lack of information, but
#the lack of scores does not mean that the (F1) genotyping is less reliable.
}
q2
} #calc_q2
calc_q3 <- function(F1.NAfrac, fracNA_threshold) {
#q3 is about the fraction missing F1 scores.
#If that is larger than NA.threshold, q3=0;
#else q3 depends on the fraction scored F1's
if (F1.NAfrac > fracNA_threshold) {
q3 <- 0
} else if (fracNA_threshold > 0.2) {
if (F1.NAfrac >= 0.2) {
# 0.2<=F1.NAfrac<threshold
q3 <- linterpol(F1.NAfrac, c(0.2, 0.5), c(fracNA_threshold, 0))
} else {
# 0<=F1.NAfrac<0.2
q3 <- linterpol(F1.NAfrac, c(0, 1), c(0.2, 0.5))
}
} else {
# fracNA_threshold <= 0.2, 0<=F1.NAfrac<threshold
#: let q3 go from 1 at frac=0 to 0.5 at frac=threshold:
q3 <- linterpol(F1.NAfrac, c(0, 1), c(fracNA_threshold, 0.5))
}
q3
} #calc_q3
calc_qall <- function(Pvalue_threshold, fracInvalid_threshold,fracNA_threshold,
Pvalue, fracInvalid, F1.NAfrac,
matchParents,
bestfit, bestParentfit,
par.conflicts, par.NAfrac,
critweight,
parentsScoredWithF1) {
# 3 thresholds: can be disabled by setting to 0, 1 and 1
# Pvalue, fracInvalid and matchParents are properties of the selfit segtype
# F1.NAfrac is the fraction missing scores among F1 samples,
# bestfit and bestParentfit are two segtype, selfit is one of these
# par.conflicts is a logical vector: is there a conflict for P1, P2?
# par.NA is the number of NA scores for the samples of P1, P2
# critweight is a numeric vector of length 3 (3 quality criteria)
# with their weights, or NA (in that case qall = q1*q2*q3, not a
# weighted average)
# return value: a vector of length 4 or 5. The last (two) components of the
# vector are qall_mult (obtained by multiplication of q1..q3),
# and if critweights is supplied, qall_weights (a weighted
# average of q1..q3)
q <- numeric(ifelse (length(critweight) == 3, 5, 4)) #last (two) are qall
q[1] <- calc_q1(Pvalue, fracInvalid, bestfit, bestParentfit,
Pvalue_threshold, fracInvalid_threshold)
q[2] <- calc_q2(par.conflicts, par.NAfrac, matchParents,
parentsScoredWithF1)
q[3] <- calc_q3(F1.NAfrac, fracNA_threshold)
q[4] <- prod(q[1:3]) #q[4] is qall_mult
if (length(critweight) == 3) {
q[5] <- sum(q[1:3] * critweight) / sum(critweight) #q[5] is qall_weights
}
q
} #calc_qall
# checkF1 *******************************************************************
#'@title Identify the best-fitting F1 segregation types
#'@description For a given set of F1 and parental samples, this function
#'finds the best-fitting segregation type. It can perform a dosage shift prior
#'to selecting the segregation type.
#'
#'@usage checkF1(scores, parent1, parent2, F1, ancestors=character(0),
#'polysomic, disomic, mixed, ploidy, ploidy2, outfile,
#'critweight=c(1.0, 0.4, 0.4),
#'scores_long=TRUE, Pvalue_threshold=0.0001, fracInvalid_threshold=0.05,
#'fracNA_threshold=0.25, shiftmarkers, parentsScoredWithF1,
#'shiftParents=parentsScoredWithF1, showAll=FALSE, append_shf=FALSE)
#'
#'@param scores A data frame as read from the scores file produced by function
#'fitMarkers of package fitPoly (or a subset with at least columns
#'MarkerName, SampleName and geno), or a data frame as returned by function
#'scores2wide. In the first case (default) parameter scores_long must be TRUE,
#'in the second case it must be FALSE.
#'@param parent1 character vector with the sample names of parent 1
#'@param parent2 character vector with the sample names of parent 2
#'@param F1 character vector with the sample names of the F1 individuals
#'@param ancestors character vector with the sample names of any other
#'ancestors or other samples of interest. The dosages of these samples will
#'be shown in the output (shifted if shiftParents TRUE) but they are not used
#'in the selection of the segregation type.
#'@param polysomic if TRUE at least all polysomic segtypes are considered;
#'if FALSE these are not specifically selected (but if e.g. disomic is TRUE,
#'any polysomic segtypes that are also disomic will still be considered)
#'@param disomic if TRUE at least all disomic segtypes are considered (see
#'param polysomic)
#'@param mixed if TRUE at least all mixed segtypes are considered (see
#'param polysomic). A mixed segtype occurs when inheritance in one parent is
#'polysomic (random chromosome pairing) and in the other parent disomic (fully
#'preferential chromosome pairing)
#'@param ploidy The ploidy of parent 1 (must be even, 2 (diploid) or larger).
#'@param ploidy2 The ploidy of parent 2. If omitted it is
#'assumed to be equal to ploidy.
#'@param outfile the tab-separated text file to write the output to; if NA a
#'temporary file checkF1.tmp is created in the current working directory
#'and deleted at end
#'@param critweight NA or a numeric vector containing the weights of three quality
#'criteria; do not need to sum to 1. If NA, the output will not contain a
#'column qall_weights. Else the weights specify how qall_weights will be
#'calculated from quality parameters q1, q2 and q3.
#'@param scores_long TRUE if scores is in "long format", FALSE if it is in
#'"wide format" (see parameter scores)
#'@param Pvalue_threshold a minimum threshold value for the Pvalue of the
#'bestParentfit segtype (with a smaller Pvalue the q1 quality parameter will
#'be set to 0)
#'@param fracInvalid_threshold a maximum threshold for the fracInvalid of the
#'bestParentfit segtype (with a larger fraction of invalid dosages in the F1
#'the q1 quality parameter will be set to 0)
#'@param fracNA_threshold a maximum threshold for the fraction of unscored F1
#'samples (with a larger fraction of unscored samples in the F1
#'the q3 quality parameter will be set to 0)
#'@param shiftmarkers if specified, shiftmarkers must be a data frame with
#'columns MarkerName and shift; for the markernames that match exactly
#'(upper/lowercase etc) those in scores, the dosages are increased by the
#'amount specified in column shift,
#'e.g. if shift is -1, dosages 2..ploidy are converted to 1..(ploidy-1)
#'and dosage 0 is a combination of old dosages 0 and 1, for all samples.
#'The segregation check
#'is then performed with the shifted dosages.
#'A shift=NA is allowed, these markers will not be shifted.
#'The sets of markers in scores and shiftmarkers
#'may be different, but markers may occur only once in shiftmarkers.
#'A column shift is added at the end of the returned data frame.\cr
#'If parameter shiftParents is TRUE, the parental and ancestor scores are
#'shifted as the F1 scores, if FALSE they are not shifted.
#'@param parentsScoredWithF1 TRUE means parents are scored in the same experiment
#'and the same fitPoly run as the F1, else FALSE.
#'If TRUE, the fraction missing scores and conflicts in the parents
#'tell something about the quality of the scoring. If FALSE
#'(e.g. when the F1 is triploid and the parents are diploid and tetraploid) the
#'quality of the F1 scores can be independent of that of the parents.\cr
#'If not specified, TRUE is assumed if ploidy2 == ploidy and FALSE if
#'ploidy2 != ploidy
#'@param shiftParents only used if parameter shiftmarkers is specified. If TRUE,
#'apply the shifts also to the parental and ancestor scores.
#'By default TRUE if parentsScoredWithF1 is TRUE, else FALSE
#'@param showAll (default FALSE) if TRUE, for each segtype 3 columns
#'are added to the returned data frame with the frqInvalid, Pvalue and
#'matchParents values for these segtype (see the description of the return value)
#'@param append_shf if TRUE and parameter shiftmarkers is specified, _shf is
#'appended to all marker names where shift is not 0. This is not required for
#'any of the functions in this package but may prevent duplicated marker names
#'when using other software.
#'@details For each marker is tested how well the different segregation types
#'fit with the observed parental and F1 dosages. The results are summarized
#'by columns bestParentfit (which is the best fitting segregation type,
#'taking into account the F1 and parental dosages) and columns qall_mult
#'and/or qall_weights (how good is the fit of the bestParentfit segtype: 0=bad,
#'1=good).\cr
#'Column bestfit in the results gives the segtype best fitting the F1
#'segregation without taking account of the parents. This bestfit segtype is
#'used by function correctDosages, which tests for possible "shifts" in
#'the marker models. Both bestfit and bestParentfit are restricted by the
#'parameters polysomic, disomic and mixed. Further they are restricted to
#'segtypes that can only occur when the parental dosages are equal, if
#'parent1 and parent2 list the same samples (but not if both are empty).\cr
#'In case the parents are not scored together with the F1 (e.g. if the F1 is
#'triploid and the parents are diploid and tetraploid) the scores data frame
#'should be edited to contain the parental as well as the F1 scores.
#'In case the diploid and tetraploid parent are scored in the same run of
#'function fitMarkers (package fitPoly)
#'the diploid is initially scored as nulliplex-duplex-quadruplex (dosage 0, 2
#'or 4); that must be converted to the true diploid dosage scores (0, 1 or 2).
#'Similar corrections are needed with other combinations, such as a diploid
#'parent scored together with a hexaploid population etc.
#'
#'@return A data frame with one row per markers, with the following columns:
#'\itemize{
#'\item{m: the sequential number of the marker (as assigned by fitPoly)}
#'\item{MarkerName: the name of the marker, with _shf appended if the marker
#'is shifted and append_shf is TRUE}
#'\item{parent1: consensus dosage score of the samples of parent 1}
#'\item{parent2: consensus dosage score of the samples of parent 2}
#'\item{F1_0 ... F1_<ploidy>: the number of F1 samples with dosage scores
#'0 ... <ploidy>}
#'\item{F1_NA: the number of F1 samples with a missing dosage score}
#'\item{sample names of parents and ancestors: the dosage scores for those
#'samples}
#'\item{bestfit: the best fitting segtype, considering only the F1 samples}
#'\item{frqInvalid_bestfit: for the bestfit segtype, the frequency of F1 samples
#'with a dosage score that is invalid (that should not occur). The frequency is
#'calculated as the number of invalid samples divided by the number of non-NA
#'samples}
#'\item{Pvalue_bestfit: the chisquare test P-value for the observed
#'distribution of dosage scores vs the expected fractions. For segtypes
#'where only one dosage is expected (1_0, 1_1 etc) the binomial probability of
#'the number of invalid scores is given, assuming an error
#'rate of seg_invalidrate (hard-coded as 0.03)}
#'\item{matchParent_bestfit: indication how the bestfit segtype matches the
#'consensus dosages of parent 1 and 2: "Unknown"=both parental
#'dosages unknown; "No"=one or both parental dosages known
#'and conflicting with the segtype; "OneOK"= only one parental
#'dosage known, not conflicting with the segtype; "Yes"=both
#'parental dosages known and combination matching with
#'the segtype. This score is initially assigned based on
#'only high-confidence parental consensus scores; if
#'low-confidence dosages are confirmed by the F1, the
#'matchParent for (only) the selected segtype is
#'updated, as are the parental consensus scores.}
#'\item{bestParentfit: the best fitting segtype that does not conflict with
#'the parental consensus scores}
#'\item{frqInvalid_bestParentfit, Pvalue_bestParentfit,
#'matchParent_bestParentfit: same as the corresponding columns for bestfit.
#'Note that matchParent_bestParentfit cannot be "No".}
#'\item{q1_segtypefit: a value from 0 (bad) to 1 (good), a measure of the fit of
#'the bestParentfit segtype based on Pvalue, invalidP and whether bestfit is
#'equal to bestParentfit}
#'\item{q2_parents: a value from 0 (bad) to 1 (good), based either on the
#'quality of the parental scores (the number of missing scores and of
#'conflicting scores, if parentsScoredWithF1 is TRUE) or on matchParents
#'(No=0, Unknown=0.65, OneOK=0.9, Yes=1, if parentsScoredWithF1 is FALSE)}
#'\item{q3_fracscored: a value from 0 (bad) to 1 (good), based on the fraction
#'of F1 samples that have a non-missing dosage score}
#'\item{qall_mult: a value from 0 (bad) to 1 (good), a summary quality score
#'equal to the product q1*q2*q3. Equal to 0 if any of these is 0, hence
#'sensitive to thresholds; a natural selection criterion would be to accept
#'all markers with qall_mult > 0}
#'\item{qall_weights: a value from 0 (bad) to 1 (good), a weighted average of
#'q1, q2 and q3, with weights as specified in parameter critweight. This column is
#'present only if critweight is specified. In this case there is no "natural"
#'threshold; a threshold for selection of markers must be obtained by inspecting
#'XY-plots of markers over a range of qall_weights values}
#'\item{shift: if shiftmarkers is specified a column shift is added with
#'for all markers the applied shift (for the unshifted markers the shift value
#'is 0)}
#'}
#'qall_mult and/or qall_weights can be used to compare the quality
#'of the SNPs within one analysis and one F1 population but not between analyses
#'or between different F1 populations.\cr
#'If parameter showAll is TRUE there are 3 additional columns for each
#'segtype with names frqInvalid_<segtype>, Pvalue_<segtype> and
#'matchParent_<segtype>; see the corresponding columns for bestfit for an
#'explanation. These extra columns are inserted directly before the bestfit
#'column.
#'@export
checkF1 <- function(scores,
parent1,
parent2,
F1,
ancestors=character(0),
polysomic, disomic, mixed,
ploidy, ploidy2,
outfile,
critweight=c(1.0, 0.4, 0.4),
scores_long=TRUE,
Pvalue_threshold=0.0001,
fracInvalid_threshold=0.05,
fracNA_threshold=0.25,
shiftmarkers,
parentsScoredWithF1,
shiftParents=parentsScoredWithF1,
showAll=FALSE,
append_shf=FALSE) {
if (ploidy %% 2 != 0) stop("checkF1: odd ploidy not allowed")
if (missing(ploidy2) || is.na(ploidy2)) ploidy2 <- ploidy else
if (ploidy2 %% 2 != 0) stop("checkF1: odd ploidy2 not allowed")
ploidyF1 <- (ploidy + ploidy2) / 2
if (missing(parentsScoredWithF1))
parentsScoredWithF1 <- ploidy2 == ploidy
#check critweight
if (is.null(critweight) || is.na(critweight[1])) critweight <- NA else {
if (!is.numeric(critweight) || length(critweight) != 3 ||
sum(is.na(critweight)) > 0 || sum(critweight) == 0) {
stop("invalid critweight")
}
}
#check shiftmarkers:
if (missing(shiftmarkers)) shiftmarkers <- NA
if (is.data.frame(shiftmarkers)) {
if (sum(is.na(match(c("MarkerName", "shift"), names(shiftmarkers)))) > 0)
stop("checkF1: shiftmarkers must have columns MarkerName and shift")
shiftmarkers$shift[is.na(shiftmarkers$shift)] <- 0
if (nrow(shiftmarkers) == 0) shiftmarkers <- NA else {
if (sum(is.na(shiftmarkers$shift)) > 0 ||
sum(!(shiftmarkers$shift %in% ((-ploidyF1):ploidyF1))) > 0)
stop("checkF1: shiftmarkers contains invalid shift values")
shiftmarkers$MarkerName <- as.character(shiftmarkers$MarkerName)
if (length(unique(shiftmarkers$MarkerName)) < nrow(shiftmarkers))
stop("checkF1: some markernames occur more than once in shiftmarkers")
}
}
seg_invalidrate <- 0.03 #assumed percentage of scoring errors leading to invalid dosages
file.del <- is.na(outfile) || outfile == ""
if (file.del) outfile <- "checkF1.tmp"
if (!checkFilename(outfile))
stop(paste("checkF1: cannot write file", outfile))
#fill in missing vectors and/or clean up:
if (missing(parent1) || is.logical(parent1)) parent1 <- character(0)
if (missing(parent2) || is.logical(parent2)) parent2 <- character(0)
if (missing(ancestors) || is.logical(ancestors)) ancestors <- character(0)
parent1 <- sort(as.character(parent1[!is.na(parent1)]))
parent2 <- sort(as.character(parent2[!is.na(parent2)]))
ancestors <- as.character(ancestors[!is.na(ancestors)])
#select the relevant segtypes:
if (!polysomic && !disomic && !mixed)
stop("checkF1: at least one of polysomic, disomic and mixed must be TRUE")
seginfo <- calcSegtypeInfo(ploidy, ploidy2)
allsegtypenames <- names(seginfo)
seginfo <- selSegtypeInfo(seginfo, polysomic, disomic, mixed,
selfing=checkSelfing(parent1, parent2))
seginfoSummary <- segtypeInfoSummary(seginfo)
# names of quality parameters:
qnames <- c("q1_segtypefit"," q2_parents","q3_fracscored","qall_mult")
if (length(critweight) == 3) qnames[5] <- "qall_weights"
# check scores:
if (scores_long) {
if (anyNA(match(c("marker", "MarkerName", "SampleName", "geno"),
names(scores))))
stop("checkF1: some required columns missing from scores")
if (anyNA(scores$marker) || anyNA(scores$MarkerName) ||
anyNA(scores$SampleName))
stop("checkF1: no missing values allowed in columns marker, MarkerName or SampleName of scores")
mrknames <- sort(as.character(unique(scores$MarkerName)))
} else {
if (length(names(scores)) < 2 ||
!identical(c("marker", "MarkerName"), names(scores)[1:2]))
stop("checkF1: first two columns of scores should be marker and MarkerName")
if (anyNA(scores$marker) || anyNA(scores$MarkerName))
stop("checkF1: no missing values allowed in columns marker and MarkerName of scores")
mrknames <- as.character(scores$MarkerName)
}
createResultsdf <- function(mrkcount) {
#function within checkF1
saf <- getOption("stringsAsFactors")
options(stringsAsFactors = FALSE)
#create the results data frame for this batch:
mat <- matrix(
integer((2+ploidyF1+2+length(parent1)+length(parent2)+length(ancestors)) *
mrkcount), nrow=mrkcount)
colnames(mat) <- c("parent1", "parent2", paste("F1", 0:ploidyF1, sep="_"),
"F1_NA", parent1, parent2, ancestors)
#segdf is the template for all individual segtype data frames
#(sets of 3 columns, and one row for each marker):
segdf <- data.frame(frqInvalid=character(mrkcount), #printed as formatted strings
Pvalue=character(mrkcount),
matchParent=factor(rep(NA,mrkcount),
levels=c("No","OneOK","Unknown","Yes")))
#bres is the data frame that will hold the results (one row per marker)
bres <- data.frame(
m=integer(mrkcount),
MarkerName=character(mrkcount),
mat
)
if (showAll) for (sg in 1:length(seginfo)) {
df <- segdf
names(df) <- paste(names(df), names(seginfo)[sg], sep="_")
bres <- data.frame(bres, df)
}
segdf <- data.frame(
fit = factor(rep(NA, mrkcount), levels=names(seginfo)),
segdf
)
for (fi in c("bestfit", "bestParentfit")) {
df <- segdf
names(df) <- c(fi,
paste(names(df)[2:4], fi, sep="_"))
bres <- data.frame(bres, df)
}
bres <- data.frame (
bres,
q1_segtypefit=character(mrkcount),
q2_parents=character(mrkcount),
q3_fracscored=character(mrkcount),
qall_mult=character(mrkcount)
)
if (length(critweight) == 3)
bres <- data.frame(
bres,
qall_weights=character(mrkcount)
)
if (is.data.frame(shiftmarkers))
bres <- data.frame(
bres,
shift=integer(mrkcount))
options(stringsAsFactors = saf)
bres
} #createResultsdf within checkF1
segtypeBestSelcrit <- function(candidates) {
#function within checkF1
#candidates is a vector indexing the segtypes in results from which
# the one with the best selcrit must be selected
#return value: index to that segtype, or 0 if none
if (length(candidates) == 0) return(0)
candSelcrit <- selcrit[candidates]
candidates[which.max(candSelcrit)]
} # segtypeBestSelcrit within checkF1
compareFit <- function(newsegtype, oldsegtype) {
#function within checkF1
#newsegtype, oldsegtype: two indices into results
#return value: TRUE if newsegtype is selected,
# FALSE if oldsegtype is selected
(results$fracInvalid[newsegtype] <=
max(0.05, 1.5 * results$fracInvalid[oldsegtype])) &&
(results$Pvalue[newsegtype] >=
min(0.01, 0.1 * results$Pvalue[oldsegtype]))
} #compareFit within checkF1
#subsetting from a big scores data.frame takes a lot of time
#(approx 2 sec for each marker, in a file of ~26000 markers and 480 samples)
#Therefore we subdivide it in batches of 100 markers
#if not scores_long this probably is less relevant but doesn't do any harm
batchsize <- 100
batchnr <- 1
while (batchsize * (batchnr-1) < length(mrknames)) {
minmrk <- batchsize*(batchnr-1) + 1
maxmrk <- min(length(mrknames), batchsize*batchnr)
if (scores_long) {
#batchscores <- subset(scores, subset=markername %in% mrknames[minmrk:maxmrk]) #only 260 * 2 sec
batchscores <- scores[scores$MarkerName %in% mrknames[minmrk:maxmrk],] #only 260 * 2 sec
} else {
batchscores <- scores[scores$MarkerName %in% mrknames[minmrk:maxmrk],,
drop=FALSE]
}
bres <- createResultsdf(maxmrk - minmrk + 1) #bres: batchresults
for (mrk in minmrk:maxmrk) {
if (scores_long) {
sc <- batchscores[batchscores$MarkerName == mrknames[mrk],] #very fast now
if ("marker" %in% names(sc)) {
mrknr <- sc$marker[1] #marker number in fitPoly input
} else mrknr <- mrk
} else {
mrknr <- which(batchscores$MarkerName == mrknames[mrk])
sc <- data.frame(SampleName=names(batchscores)[3:length(batchscores)],
geno=as.integer(batchscores[mrknr, 3:length(batchscores)]))
mrknr <- batchscores$marker[mrknr] #marker number in fitPoly input
}
#get all the parent and ancestor genotypes in the order of the given vectors:
parent.geno <- list()
parent.geno[[1]] <- sc$geno[match(parent1, sc$SampleName)]
parent.geno[[2]] <- sc$geno[match(parent2, sc$SampleName)]
ancestors.geno <- sc$geno[match(ancestors, sc$SampleName)]
F1.geno <- sc$geno[sc$SampleName %in% F1] #F1 does not have to be ordered
shift <- 0
if (is.data.frame(shiftmarkers)) {
whichshift <- which(shiftmarkers$MarkerName == mrknames[mrk])
if (length(whichshift) == 1) {
shift <- shiftmarkers$shift[whichshift]
if (shift != 0) {
#changed: now all shifted dosages outside 0..ploidyF1 are set
#to 0 or ploidyF1 (i.e. dosages that were separated are now merged),
#before 20161130 they were set to NA
#F1.geno <- ancestors.geno + shift
#F1.geno[!(F1.geno %in% 0:ploidyF1)] <- NA
F1.geno <- shiftdosages(F1.geno, shift, ploidyF1)
if (shiftParents) {
#also here new shiftdosages is used:
parent.geno[[1]] <- shiftdosages(parent.geno[[1]], shift, ploidyF1)
parent.geno[[2]] <- shiftdosages(parent.geno[[2]], shift, ploidyF1)
ancestors.geno <- shiftdosages(ancestors.geno, shift, ploidyF1)
}
}
}
}
#get the consensus genotype and conflict status of each of the two parents:
par.geno <- c(0, 0) #integer
par.lowconf.geno <- c(0, 0) #integer
par.conflicts <- c(FALSE, FALSE) #logical
par.NAfrac <- c(0.5, 0.5) #floatingpoint
for (parent in 1:2) {
#if (strict.parents) maxNAfrac <- 0.499 else maxNAfrac <- 1
parresult <- getConsensusGeno(geno=parent.geno[[parent]],
maxNAfrac=0.499,
lowconf.NAfrac=0.751)
par.geno[parent] <- parresult$geno
par.lowconf.geno[parent] <- parresult$lowconf.geno
par.conflicts[parent] <- parresult$conflict
par.NAfrac[parent] <- parresult$NAfrac
}
#get the freq. distribution of the F1 samples:
F1.naCount <- sum(is.na(F1.geno))
F1.nobs <- length(F1.geno) - F1.naCount #the number of non-NA F1 samples
#F1.counts <- hist(F1.geno, breaks=seq(-0.5, ploidyF1+0.5, by=1), plot=FALSE)$counts
F1.counts <- tabulate(bin=F1.geno + 1, nbins=ploidyF1 + 1)
#note that F1.counts[1] has the number of F1 samples with geno==0 etc !
bestfit <- NA; bestParentfit <- NA;
q <- rep(NA, length(qnames))
results <- data.frame(segtype=names(seginfo),
fracInvalid=rep(1.0, length(seginfo)),
invalidP=rep(0.0, length(seginfo)),
Pvalue=rep(0.0, length(seginfo)),
matchParents=I(as.character(rep(NA,length(seginfo)))))
if (F1.nobs > 10) {
for (s in 1:length(seginfo)) {
#does the current segtype match the parental genotypes?
results$matchParents[s] <-
getMatchParents(parGeno=par.geno, seginfoItem=seginfo[[s]])
#calculate fraction invalid F1 scores given the current segtype:
exp.geno <- seginfo[[s]]$expgeno #the genotypes that can occur given the segtype
F1.invalid <-
length(F1.geno[!(F1.geno %in% exp.geno)]) - F1.naCount #number of invalid scores
results$fracInvalid[s] <- F1.invalid / F1.nobs
#calculate chi-square P-value given the current segtype and invalidP value:
if (F1.nobs - F1.invalid > 0) {
#at least some F1 samples in expected peaks
#(else Pvalue and invalidP remain 0)
results$invalidP[s] <-
pbinom(q=F1.nobs - F1.invalid, #nr of valid scores
size=F1.nobs, #total nr of scores
prob=1 - seg_invalidrate) #expected fraction of valid scores
if (length(exp.geno) == 1) {
results$Pvalue[s] <- 1.0
} else {
#more than one genotype expected, chisquare test
if (sum(F1.counts[exp.geno + 1]) == 0) {
#all observations are in invalid classes
results$Pvalue[s] <- 0.0
results$invalidP[s] <- 0.0
} else {
suppressWarnings(
results$Pvalue[s] <-
chisq.test(F1.counts[exp.geno+1],
p=seginfo[[s]]$freq[exp.geno+1])$p.value)
#warnings probably generated by class counts < 5;
#these do not worry us
#as those P-values are likely to be very significant anyway
#also the reported warnings are not informative
}
}
}
} # for (s in 1:length(seginfo))
#select best fitting segtype based on fracInvalid and Pvalue
#selcrit: selection criterion
selcrit <- results$invalidP * results$Pvalue
bestfit <- which.max(selcrit) #discards NA's
if (bestfit == 0)
stop(paste("Error in checkF1: bestfit is 0 at marker", mrknames[mrk]))
#select which segtype best fits the parental genotype(s)
ParentFit <- which(results$matchParents %in% c("Yes","OneOK","Unknown"))
bestParentfit <- segtypeBestSelcrit(ParentFit)
if (bestParentfit == 0)
stop(paste("Error in checkF1: bestParentfit is 0 at marker", mrknames[mrk]))
#bestParentfit is 0 if all segtypes have matchParents=="No" (should never occur)
#Can we improve on bestParentfit by using low-confidence parental scores?
lowParentFit <- NA
lowc <- which(!is.na(par.lowconf.geno))
if (length(lowc) > 0) {
if (length(lowc) == 1) {
if (is.na(par.geno[3 - lowc])) {
#one parent missing, other lowconf
low.segtypes <- seginfoSummary$segtypenr[
seginfoSummary[, 2+lowc] == par.lowconf.geno[lowc]]
} else {
#one parent high conf, other lowconf score
low.segtypes <- seginfoSummary$segtypenr[
(seginfoSummary[, 2+lowc] == par.lowconf.geno[lowc]) &
(seginfoSummary[, 5-lowc] == par.geno[3-lowc])]
}
if (length(low.segtypes) > 0) {
lowParentfit <- segtypeBestSelcrit(low.segtypes)
#now we must check if lowParentfit is equal to or not
#too much worse than bestParentfit: if so, we promote the
#par.lowconf.geno to a true geno and lowParentfit to
#bestParentfit
if (compareFit(lowParentfit, bestParentfit)) {
par.geno[lowc] <- par.lowconf.geno[lowc]
bestParentfit <- lowParentfit
results$matchParents[bestParentfit] <-
getMatchParents(parGeno=par.geno,
seginfoItem=seginfo[[bestParentfit]])
}
}
} else {
#both parents have a low-conf score:
#we should check for both scores if they are acceptable
#(i.e. equal to, or not too much worse than bestfit)
#If one of them is ok and not the other, we promote it and its lowParentfit;
#if both pass and their combination is ok we promote both
#and their combined lowParentfit;
#if both pass but the combination does not pass we don't
#promote either and keep bestfit as bestParentfit
#First the combination:
low.segtypes <- seginfoSummary$segtypenr[
(seginfoSummary[, 3] == par.lowconf.geno[1]) &
(seginfoSummary[, 4] == par.lowconf.geno[2])]
lowParentfit <- segtypeBestSelcrit(low.segtypes)
if (compareFit(lowParentfit, bestParentfit)) {
#the combination of both par.lowconf.geno is acceptable
par.geno <- par.lowconf.geno
bestParentfit <- lowParentfit
results$matchParents[bestParentfit] <-
getMatchParents(parGeno=par.geno,
seginfoItem=seginfo[[bestParentfit]])
#always "Yes"?
} else {
#the combination of both par.lowconf.geno is not acceptable,
#check the two parents separately:
lowParentfit <- c(0,0)
for (p in 1:2) {
low.segtypes <- seginfoSummary$segtypenr[
seginfoSummary[, 2+p] == par.lowconf.geno[p]]
lowParentfit[p] <- segtypeBestSelcrit(low.segtypes)
if (!compareFit(lowParentfit[p], bestParentfit))
lowParentfit[p] <- 0
}
p <- which(lowParentfit != 0)
if (length(p) == 1) {
#just one parent gives an acceptable lowParentfit, use that one:
par.geno[p] <- par.lowconf.geno[p]
bestParentfit <- lowParentfit[p]
results$matchParents[bestParentfit] <-
getMatchParents(parGeno=par.geno,
seginfoItem=seginfo[[bestParentfit]])
#always "OneOK"?
}
} #two par.lowconf.geno, the combination was not acceptable
} #two par.lowconf.geno
} #at least one par.lowconf.geno
#Note that we don't attempt here to resolve mismatches between the final
#par.geno and bestParentfit, nor to fill in missing par.geno if the
#bestParentfit segtype would allow only one solution.
#That is done in compareProbes (and therefore also in writeDosagefile).
#A further expansion to generate markers with all possible par.geno
#allowed by the segtype and the known par.geno is done in
#expandUnknownParents
# generate a set of quality parameters of the marker in general and
# the fit of bestParentfit:
q <- calc_qall(Pvalue_threshold, fracInvalid_threshold, fracNA_threshold,
Pvalue=results$Pvalue[bestParentfit],
fracInvalid=results$fracInvalid[bestParentfit],
F1.NAfrac=F1.naCount / length(F1.geno),
matchParents=results$matchParents[bestParentfit],
bestfit=bestfit, bestParentfit=bestParentfit,
par.conflicts=par.conflicts,
par.NAfrac=par.NAfrac,
critweight=critweight,
parentsScoredWithF1=parentsScoredWithF1)
} # F1.nobs>10
#get the output line:
bix <- mrk - minmrk + 1 #index to row in bres
bres$m[bix] <- mrknr
if (shift==0 || !append_shf) bres$MarkerName[bix] <- mrknames[mrk] else
bres$MarkerName[bix] <- paste(mrknames[mrk], "shf", sep="_")
bres[bix, 3:4] <- par.geno
bres[bix, 5:(5+ploidyF1)] <- F1.counts
bres$F1_NA[bix] <- F1.naCount
startcol <- ploidyF1 + 7
if (length(parent1) > 0) {
bres[bix, startcol:(startcol-1+length(parent1))] <- parent.geno[[1]]
startcol <- startcol + length(parent1)
}
if (length(parent2) > 0) {
bres[bix, startcol:(startcol-1+length(parent2))] <- parent.geno[[2]]
startcol <- startcol + length(parent2)
}
if (length(ancestors) > 0) {
bres[bix, startcol:(startcol-1+length(ancestors))] <- ancestors.geno[[1]]
startcol <- startcol + length(ancestors)
}
if (showAll) {
bres[bix, startcol+seq(0, by=3, length.out=length(seginfo))] <-
sprintf("%.4f", results$fracInvalid)
bres[bix, startcol+seq(1, by=3, length.out=length(seginfo))] <-
sprintf("%.4f", results$Pvalue)
bres[bix, startcol+seq(2, by=3, length.out=length(seginfo))] <-
results$matchParents
startcol <- startcol + 3 * length(seginfo)
}
if (is.na(bestfit)) {
bres[bix, startcol:(startcol+3)] <- rep(NA, 4)
} else {
bres[bix, startcol] <- results$segtype[bestfit]
bres[bix, startcol+1] <- sprintf("%.4f", results$fracInvalid[bestfit])
bres[bix, startcol+2] <- sprintf("%.4f", results$Pvalue[bestfit])
bres[bix, startcol+3] <- results$matchParents[bestfit]
}
startcol <- startcol + 4
if (is.na(bestParentfit)) {
bres[bix, startcol:(startcol+3)] <- rep(NA, 4)
} else {
bres[bix, startcol] <- results$segtype[bestParentfit]
bres[bix, startcol+1] <- sprintf("%.4f", results$fracInvalid[bestParentfit])
bres[bix, startcol+2] <- sprintf("%.4f", results$Pvalue[bestParentfit])
bres[bix, startcol+3] <- results$matchParents[bestParentfit]
}
startcol <- startcol + 4
if (is.na(q[1])) {
bres[bix, startcol:(startcol-1+length(q))] <- rep(NA, length(q))
} else {
bres[bix, startcol:(startcol-1+length(q))] <- sprintf("%.4f", q)
}
startcol <- startcol + length(q)
if (is.data.frame(shiftmarkers)) bres[bix, startcol] <- shift
} # for mrk
if (batchnr == 1) {
writeDatfile(bres, file=outfile)
} else {
writeDatfile(bres, file=outfile, append=TRUE, col.names=FALSE)
}
batchnr <- batchnr + 1
} #while batch
output <- readDatfile(outfile)
output <- chk2integer(output)
if (file.del) file.remove(outfile)
invisible(output)
} #checkF1
#'@title Get substrings from the lefthand side
#'@description Get substrings from the lefthand side
#'@usage leftstr(x, n)
#'
#'@param x a character vector (or something having an as.character method)
#'@param n a single number: if n>=0, the leftmost n characters of each element
#'of x are selected, if n<0 the (-n) rightmost characters are omitted
#'@return character vector with leftside substrings of x
#'@export
leftstr <- function(x, n) {
#vectorized for x, not n
#n>=0: take the leftmost n characters
#n<0: take all but the rightmost (-n) characters
if (n >= 0) {
substr(x, 1, n) #automatically converts x to character if needed
} else {
#n < 0
if (!is.character(x)) x <- as.character(x)
len <- nchar(x)
substr(x, 1, len + n)
}
}
#'@title Get substrings from the righthand side
#'@description Get substrings from the righthand side
#'@usage rightstr(x, n)
#'
#'@param x a character vector (or something having an as.character method)
#'@param n a single number: if n>=0, the rightmost n characters of each element
#'of x are selected, if n<0 the (-n) leftmost characters are omitted
#'@return character vector with rightside substrings of x
#'@export
rightstr <- function(x, n) {
#vectorized for x, not n
#n>=0: take the rightmost n characters
#n<0: take all but the leftmost (-n) characters
if (!is.character(x)) x <- as.character(x)
len <- nchar(x)
if (n >= 0) {
start <- len - n + 1
substr(x, start, len)
} else {
#n < 0
substr(x, -n + 1, len)
}
}
chk2integer <- function(chk) {
# chk: a data frame as returned by checkF1
# result: the same data frame, but with all columns from Parent1 to
# last ancestor converted to integer (as these are sometimes factors
# with levels not equal to values after reading checkF1 from file)
#find out the ploidy:
F1cap <- names(chk)[which(names(chk) == "F1_NA") - 1]
ploidyF1 <- as.integer(sub("F1_", "", F1cap))
#find the column range to convert to integers:
firstcol <- which(names(chk) == "parent1")
#determine which column has the last parent or ancestor sample:
#if 3 columns for each segtype are present (showAll TRUE), then
#the first column after the last sample has name frqInvalid_<segtype> (where
#<segtype> depends on the ploidy).
#else (showAll FALSE) the next column has name "bestfit", followed
#by "frqInvalid_bestfit"
lastcol <- which(leftstr(names(chk), 10) == "frqInvalid")[1]
if (rightstr(names(chk)[lastcol], 7) == "bestfit") {
lastcol <- lastcol - 2
} else {
lastcol <- lastcol - 1
}
if (length(firstcol) != 1 || is.na(lastcol) ||
lastcol - firstcol < 3 + ploidyF1) {
stop("chk2integer: column names of chk incorrect")
}
for (i in firstcol:lastcol) {
suppressWarnings(chk[[i]] <- as.integer(as.character(chk[[i]])))
#warnings caused by NAs
}
chk
} #chk2integer
getPargeno <- function(P1consensus, P2consensus,
segtype, matchParents=NULL, seginfo,
allparentscores=FALSE) {
#P1 consensus, P2consensus: the consensus scores of the two parents
#segtype: the name of the selected segtype (selfit or (in the new polyploid
# version) bestParentfit from checkF1;
# should occur in names(seginfo); OR
# the number of the segtype in seginfo
#matchParents: "Yes" if both consensus parental dosages match segtype
# (bestParentfit.matchParents from checkF1())
#seginfo: a list as returned by calcSegtypeInfo(), with the par.dosages
# limited to those fitting the auto, allo and mixed parameters used to
# calculate checkF1
#allparentscores: if FALSE (default) and more than one combination would be
# possible, c(NA, NA) is returned. If TRUE then a matrix with
# one row for each possibility and two columns for the 2 parental
# dosages is returned.
#return value: an integer vector of two elements: the inferred dosages
# of parent 1 and 2; or (if allparentscores is TRUE and there are
# multiple combinations) a 2-column matrix with one row per
# parental combination
# we try to fill in the expected parental dosage
# based on the segtype and the consensus of the observed parental scores:
parent.sc <- c(P1consensus, P2consensus)
if (is.character(segtype)) {
segtype <- which(names(seginfo) == segtype)
if (length(segtype) != 1) stop("internal error in getPargeno")
} #else segtype is already the number of the segtype
exp.par <- seginfo[[segtype]]$pardosage #already selected to match
# auto/allo/mixed
# now get the final parental dosage from consensus and segtype:
if (nrow(exp.par) == 1) {
# both parents must have the same genotype so we can fill these in,
# irrespective of the observed parental genotype and the matchParent status:
parent.sc <- exp.par[1,]
} else {
if (is.null(matchParents)) {
matchParents <- getMatchParents(parGeno=parent.sc,
seginfoItem=seginfo[[segtype]])
}
if (matchParents != "Yes") {
# there is more than one possible parental combination.
# if matchParents=="Yes" we don't need to do anything;
# else (matchParents= Unknown, OneOK or No) we can only fill in the parents
# if one parent matches one combination and the other matches none
pmatch <- rep(NA, 2)
for (p in 1:2) pmatch[p] <- match(parent.sc[p], exp.par[, p])
if (sum(!is.na(pmatch)) == 1) {
#one parent matches one expected parental combination and the other
#doesn't match any
r <- min(pmatch, na.rm=TRUE)
parent.sc <- exp.par[r,]
}
else {
# two possibilities:
# - both parents each match a different row of exp.par
# - none of the parents matches a row of exp.par
# (if both parents would match the same row, matchParents would be Yes)
# In both cases we cannot assign the parental genotypes:
if (allparentscores) {
parent.sc <- exp.par #a 2-column matrix with 2 or more rows
} else parent.sc <- rep(NA, 2)
}
} #matchParents!="Yes"
} #nrow(exp.par) == 1 else
parent.sc
} #getPargeno
trim_shf <- function(mrknames) {
#trim _shf from the marker names:
if (is.factor(mrknames)) mrknames <- as.character(mrknames)
if (is.character(mrknames)) {
shf <- rightstr(mrknames, 4) == "_shf"
mrknames[shf] <- leftstr(mrknames[shf], -4)
}
mrknames
} #trim_shf
makeCombscoresDf <- function(parent1, parent2, F1, ancestors=character(0),
other=character(0), nrow){
#create an empty data frame with the correct size and column names and types
#parent1 and parent2: character vectors with sampleIDs for parents 1 and 2
# (0, 1 or multiple samples per parent allowed, 0 samples are
# specified as character(0))
#F1: character vector with sampleIDs for F1 individuals (one sample per
# F1 individual)
#ancestors: character vector of other ancestor samples listed in the
# checkF1 output; character(0) if none
#other: character vector with sampleIDs of any other samples; character(0) if none
#nrow: the number of rows to create
ncol <- length(parent1) + length(parent2) + length(ancestors) + 2 +
length(F1) + length(other)
scores <- matrix(rep(NA_integer_, nrow * ncol), nrow=nrow)
df <- data.frame(
MarkerName=rep("", nrow),
segtype=rep("", nrow),
scores,
stringsAsFactors=FALSE
)
names(df) <- c("MarkerName", "segtype", parent1, parent2, ancestors,
"parent1", "parent2", F1, other)
df
} #makeCombscoresDf
# writeDosagefile ***********************************************************
#'@title Write a file with segregation types and dosage scores
#'@description Write a file with for each marker the segregation type and the
#'dosage scores of the parental and ancestor samples, the parental consensus
#'dosages and the F1 samples.
#'@usage writeDosagefile(chk, scores, parent1, parent2, F1,
#'ancestors=character(0), other=character(0),
#'polysomic=TRUE, disomic=FALSE, mixed=FALSE,
#'ploidy, ploidy2, shiftParents, scorefile)
#'
#'@param chk data frame as returned by checkF1
#'@param scores data frame as read from the scores file produced by function
#'fitMarkers of package fitPoly (or a subset with at least columns
#'MarkerName, SampleName and geno)
#'@param parent1 character vector with the sample names of parent 1
#'@param parent2 character vector with the sample names of parent 2
#'@param F1 character vector with the sample names of the F1 individuals
#'@param ancestors character vector with the sample names of any other
#'ancestors
#'@param other other samples that should be treated like the F1
#'@param polysomic TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param disomic TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param mixed TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param ploidy the ploidy of parent 1 (must be even, 2 (diploid) or larger)
#'and the same as used by checkF1 to calculate the chk data frame
#'@param ploidy2 the ploidy of parent 2. If omitted it is assumed to be equal
#'to ploidy. Should be the same as used by checkF1 to calculate the chk
#'data frame
#'@param shiftParents if there is a column shift in chk the F1 dosages will be
#'shifted. If shiftParents is TRUE the parents and ancestors will be shifted
#'together with the F1, if FALSE only the F1 will be shifted in that case.\cr
#'If shiftParents is missing
#'or NA it will be set to TRUE except if ploidy2 != ploidy:
#'in that case this will result in an error (because it
#'may be that the parents are not genotyped or scored together with the
#'F1, the user should specify explicitly what to do)
#'@param scorefile filename for tab-separated text file with the dosages,;
#'if NA no file is written. For details of the contents see the return value
#'
#'@return A data frame with columns:
#'\itemize{
#'\item{MarkerName: the name of the marker}
#'\item{segtype: the segregation type}
#'\item{parental and ancestor samples: the dosages of those samples}
#'\item{parent1: the consensus dosage for parent1 as determined by checkF1}
#'\item{parent2: the consensus dosage for parent2 as determined by checkF1}
#'\item{F1 samples: the dosages for those samples}
#'\item{other samples: the dosages for those samples}
#'}
#'@export
writeDosagefile <- function(chk,
scores,
parent1,
parent2,
F1,
ancestors=character(0),
other=character(0),
polysomic=TRUE, disomic=FALSE, mixed=FALSE,
ploidy, ploidy2,
shiftParents,
scorefile) {
#writeScoresfile uses compareProbes to just write the scores and segtypes
#in the same format as compareProbes. It treats each marker separately
#without comparing or combining two probes for the same SNP, but it takes
#into account any shifts (from column shift in chk, if present).
#It does so by calling compareProbes with probe.suffix=NA.
if (missing(ploidy2)) ploidy2 <- ploidy
if (missing(shiftParents)) shiftParents <- NA
result <- compareProbes(
chk=chk,
scores=scores,
probe.suffix=NA,
fracdiff.threshold=0.04,
parent1=parent1,
parent2=parent2,
F1=F1,
ancestors=ancestors,
other=other,
polysomic=polysomic, disomic=disomic, mixed=mixed,
ploidy=ploidy, ploidy2=ploidy2,
shiftParents=shiftParents,
qall_flavor="qall_mult",
compfile=NA, combscorefile=scorefile)
invisible(result$combscores)
} #writeDosagefile
# compareProbes ***************************************************************
#'@title Compare and combine results from two probes for the same SNP
#'@description On Affymetrix Axiom arrays it is possible to have two probes
#'interrogating the same SNP position. This function compares the dosage scores
#'and checkF1 results of the two probes; if they are sufficiently similar
#'a new marker is generated combining the results of the two probes. A dosage
#'file with the data for the separate probes as well as the combined markers
#'is written with the same format as writeDosagefile, and also a file
#'summarizing the comparison results.
#'
#'@usage compareProbes(chk, scores,
#'probe.suffix=c("P","Q","R"), fracdiff.threshold=0.04,
#'parent1, parent2, F1, ancestors=character(0), other=character(0),
#'polysomic=TRUE, disomic=FALSE, mixed=FALSE,
#'ploidy, ploidy2, qall_flavor="qall_mult", shiftParents,
#'compfile, combscorefile)
#'
#'@param chk data frame as returned by checkF1, or a subset with at least
#'columns markername, parent1, parent2 (the consensus parental genotypes),
#'the columns for the samples specified by parameters parent1, parent2 and
#'ancestors, and bestParentfit, and containing only rows with selected markers.
#'If a column with a name as specified by qall_flavor (see below) is present
#'this will be written to file compfile, but it is not used: any selection of
#'marker based on qall (or other) must have been made beforehand, and the
#'rows for the unwanted markers must have been deleted from the chk data frame.\cr
#'For each marker*probe combination there may be an unshifted version
#'(shift==0), a shifted one (shift!=0), both, or neither.
#'\cr
#'If a column shift is present it will be used to shift the dosages
#'(and their P-values with them).
#'If some markernames end in "_shf" this part will be ignored, but the
#'P and Q suffixes (or alternatives as specified by probe.suffix) are
#'required to distinguish the two probes.
#'@param scores data frame as read from the scores file produced by function
#'fitMarkers of package fitPoly, with at least columns MarkerName,
#'SampleName, P0 .. P<ploidyF1> and geno (where <ploidyF1> is the ploidy of the
#'F1, i.e. the average of parental ploidy and ploidy2).\cr
#'If the F1 parents are scored separately, their rows should be added to the
#'scores data.frame for the F1 samples. If their ploidy is different from the
#'F1, the number of their P columns must be adjusted. The P data of the parents
#'are not used, they may all be set to NA.
#'@param probe.suffix a 3-item character vector specifying the suffixes of the
#'marker names that distinguish the two probes. The first two items identify
#'the two probes; the third item is used to indicate a new marker combining
#'the data from both probes. The three items must be different and have the
#'same number of characters default is c("P","Q","R")
#'@param fracdiff.threshold if more than this fraction of F1 scores differs
#'between probes, don't combine
#'@param parent1 character vector with the sample names of parent 1
#'@param parent2 character vector with the sample names of parent 2
#'@param F1 character vector with the sample names of the F1 individuals
#'@param ancestors character vector with the sample names of any other
#'ancestors
#'@param other other samples that should be treated like the F1
#'@param polysomic TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param disomic TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param mixed TRUE or FALSE; should be the same as used by checkF1
#'to calculate the chk data frame
#'@param ploidy the ploidy of parent 1 (must be even, 2 (diploid) or larger),
#'and the same as used by checkF1 to calculate the chk data frame
#'@param ploidy2 the ploidy of parent 2. If omitted it is assumed to be equal
#'to ploidy. Should be the same as used by checkF1 to calculate the chk
#'data frame
#'@param qall_flavor which quality parameter column must be shown in compfile,
#'default "qall_mult". If no quality data are wanted, specify "".
#'@param shiftParents if there is a column shift in chk the F1 dosages will be
#'shifted. If shiftParents is TRUE the parents and ancestors will be shifted
#'together with the F1, if FALSE only the F1 will be shifted in that case.\cr
#'If shiftParents is missing
#'or NA it will be set to TRUE except if ploidy2 != ploidy:
#'in that case this will result in an error (because it
#'may be that the parents are not genotyped or scored together with the
#'F1, the user should specify explicitly what to do)
#'@param compfile filename for tab-separated text file summarizing the
#'comparison results; if NA no file is written. For details of the contents see
#'the return value, component compstat
#'@param combscorefile filename for tab-separated text file with the dosages;
#'if NA no file is written. For details of the contents see the return value,
#'component combscores
#'
#'@details A combined marker is made in each case that a version of each of the
#'two probe markers is present and they are sufficiently similar. This means
#'that they have been assigned the same bestParentfit segregation type by
#'checkF1, and that the frequency of conflicting scores over all samples is
#'not more than fracdiff.threshold. The combined marker will have NA scores for
#'individuals where both probe markers are missing, the one available score if
#'it is scored for only one of the two probe markers or both scores are equal,
#'and the score with the highest P-value if the scores for both probe markers
#'are unequal.\cr
#'Any single-probe markers in chk that do not have a bestParentfit segregation
#'type are ignored and will not affect or appear in the output.
#'
#'@return A list with two components, compstat and combscores.\cr
#'compstat is a data frame with columns:
#'\itemize{
#'\item{MarkerName: name of the SNP marker. If a column shift is present in
#'data.frame chk, unshifted and shifted markers will get a "n" or "s" suffixed
#'to the MarkerName}
#'\item{segtypeP and segtypeQ: the segtype assigned by checkF1 to the first and
#'second probe}
#'\item{qallP and qallQ: the quality scores specified by parameter qall_flavor,
#'assigned by checkF1 to the two probes}
#'\item{countP and countQ: the number of versions of each of the probes
#'(0, 1, or 2, depending on whether a shifted, unshifted or both versions were
#'present)}
#'\item{countR: the number of combinations made of versions of the two probe
#'markers (one for each combination of a version of each of the two probe
#'markers, if they match well enough - see details)}
#'}
#'If the chk data frame contains a column shift, there are separate columns for
#'the non-shifted and shifted P and Q probe markers (suffix Pn, Ps, Qn, Qs),
#'and four columns for the R markers (suffix Rnn, Rns, Rsn, Rss where the first
#'n/s indicates if the P was non-shifted or shifted and the second n/s for the
#'Q probe.
#'combscores is a data frame with columns:
#'\itemize{
#'\item{MarkerName: the name of the marker. If the chk data frame contains a
#'column shift, the P and Q marker names are suffixed with n or s, and the R
#'marker names with nn, ns, sn, ss as described above}
#'\item{segtype: the segregation type}
#'\item{parental and ancestor samples: the dosages of those samples}
#'\item{parent1: the consensus dosage for parent1 as determined by checkF1}
#'\item{parent2: the consensus dosage for parent2 as determined by checkF1}
#'\item{F1 samples: the dosages for those samples}
#'\item{other samples: the dosages for those samples}
#'}
#'@export
compareProbes <- function(chk,
scores,
probe.suffix=c("P","Q","R"),
fracdiff.threshold=0.04,
parent1,
parent2,
F1,
ancestors=character(0),
other=character(0),
polysomic=TRUE, disomic=FALSE, mixed=FALSE,
ploidy, ploidy2,
qall_flavor="qall_mult",
shiftParents,
compfile, combscorefile) {
#If probe.suffix is not NA but a set of 3 strings,
#this function compares the results of the two probes for the same SNP (from
#an Axiom array).
#If probe.suffix is NA, this function only writes the scores from the
#scores data frame and the segtypes from chk to combscorefile.
#This option is used by function writeScoresfile, to avoid duplication a
#lot of code and to obtain a scores file of the same format.
#first some checks:
noprobes <- length(probe.suffix) == 1 && is.na(probe.suffix)
if (noprobes) funcname <- "writeScoresfile" else funcname <- "compareProbes"
if (!noprobes &&
(length(probe.suffix) != 3 || sum(is.na(probe.suffix)) > 0 ||
max(nchar(probe.suffix)) != min(nchar(probe.suffix)) ||
sum(abs(match(probe.suffix,probe.suffix) - c(1,2,3))) > 0))
stop("compareProbes: probe.suffix invalid")
if (ploidy %% 2 != 0)
stop(paste(funcname, ": odd parental ploidy not allowed", sep=""))
if (missing(ploidy2) || is.na(ploidy2)) ploidy2 <- ploidy else
if (ploidy2 %% 2 != 0)
stop(paste(funcname, ": odd parental ploidy2 not allowed", sep=""))
ploidyF1 <- (ploidy + ploidy2) / 2
segtypecol <- which(tolower(names(chk)) == "segtype")
if (length(segtypecol) != 1)
segtypecol <- which(tolower(names(chk)) == "selfit")
if (length(segtypecol) != 1)
segtypecol <- which(tolower(names(chk)) == "bestparentfit")
if (length(segtypecol) != 1)
stop(paste(funcname,
": chk should have one column segtype or bestParentfit",
sep=""))
if (!is.character(chk[, segtypecol]) )
chk[, segtypecol] <- as.character(chk[, segtypecol])
if (!is.character(chk$MarkerName))
chk$MarkerName <- as.character(chk$MarkerName)
#delete all rows without a segtype (mostly those that are not fitted by fitPoly):
chk <- chk[!is.na(chk[, segtypecol]) & chk[, segtypecol] != "",]
qallcol <- which(tolower(names(chk)) == qall_flavor)
if (length(qallcol) > 1) qallcol <- qallcol[1]
shiftspresent <- FALSE
shiftcol <- which(tolower(names(chk)) == "shift")
if (length(shiftcol) > 1) {#none is also allowed
stop(paste(funcname, ": chk should have at most one column 'shift'"), sep="")
} else if (length(shiftcol) == 0) {
chk$shift <- rep(0, nrow(chk))
shiftcol <- which(tolower(names(chk)) == "shift")
shiftParents <- TRUE #if no shifts, shiftParents doesn't make a difference
} else {
#one existing column shift
if (missing(shiftParents) || is.na(shiftParents)) {
if (ploidy == ploidy2) shiftParents <- TRUE else
stop(paste(funcname,
": if ploidy2 != ploidy, shiftParents must be specified (FALSE/TRUE)",
sep=""))
#note that this is not required if there is no existing column shift
}
if (anyNA(chk[, shiftcol]))
stop(paste(funcname,
": column 'shift' in chk may not contain missing values",
sep=""))
shiftspresent <- TRUE
}
#check if all samples are present in scores and if their numbers match with chk:
progenitors <- c(parent1, parent2, ancestors)
progenitors <- progenitors[!is.na(progenitors)]
if (length(progenitors) == 0) shiftParents <- FALSE # no progenitors to shift
allsamp <- c(progenitors, F1, other)
if (shiftParents) shfsamp <- rep(TRUE, length(allsamp)) else
shfsamp <- c(rep(FALSE, length(progenitors)),
rep(TRUE, length(c(F1, other))))
missamp <- allsamp[!(allsamp %in% unique(as.character(scores$SampleName)))]
if (length(missamp) > 0)
stop(paste(funcname, ": some samples not in scores:",
paste(missamp, collapse=" ")))
missamp <- progenitors[!(progenitors %in% names(chk))]
if (length(missamp) > 0)
stop(paste(funcname, ": some parent or ancestor samples not in chk:",
paste(missamp, collapse=" ")))
if (length(F1) != sum(chk[1, 5:(6+ploidyF1)]))
stop(paste(funcname, ": F1 has", length(F1), "samples but in chk",
sum(chk[1, 5:(6+ploidyF1)]), "values counted"))
chk$SNPname <- trim_shf(chk$MarkerName)
if (anyNA(match(chk$SNPname, unique(as.character(scores$MarkerName)))))
stop(paste(funcname, ": not all markers from chk appear in scores", sep=""))
if (noprobes) {
#writeDosagefile
chk$probe <- rep("", nrow(chk))
} else {
mnl <- nchar(chk$SNPname)
chk$probe <- substr(chk$SNPname, mnl - nchar(probe.suffix[1]) + 1, mnl) #P or Q
chk$SNPname <- substr(chk$SNPname, 1, mnl - nchar(probe.suffix[1])) #without P or Q
}
#now chk has some additional columns:
#chk$shift: if it was not present, now has all 0's - the amount each marker is shifted
#chk$probe: the probe suffix (P or Q by default)
#probchk$SNPname: the original SNPname, without probe suffix or _shf
#Note that the _shf suffix is not used, the shift value (0 vs not 0)
#determines the shift status
if (!noprobes && sum(!(chk$probe %in% probe.suffix[1:2])) > 0)
stop("compareProbes: not all marker names have a specified probe suffix")
if (!noprobes && sum(nchar(chk$SNPname) == 0) > 0)
stop("compareProbes: not all marker names have a specified probe suffix")
#make sure there is no locked file that will prevent writing the output:
if (!is.na(compfile) && !checkFilename(compfile))
stop(paste(funcname,": compfile not valid or not writable", sep=""))
if (!is.na(combscorefile) && !checkFilename(combscorefile))
stop(paste(funcname, ": combscorefile not valid or not writable", sep=""))
if (is.factor(chk$parent1)) chk$parent1 <- as.integer(as.character(chk$parent1))
if (is.factor(chk$parent2)) chk$parent2 <- as.integer(as.character(chk$parent2))
if (is.factor(chk[,shiftcol]))
chk[,shiftcol] <- as.integer(as.character(chk[,shiftcol]))
if (is.factor(parent1)) parent1 <- as.character(parent1)
if (is.factor(parent2)) parent2 <- as.character(parent2)
if (is.factor(F1)) F1 <- as.character(F1)
if (is.factor(ancestors)) ancestors <- as.character(ancestors)
if (is.factor(other)) other <- as.character(other)
snpnames <- sort(unique(chk$SNPname)) #SNPname is character
#note that the two probes don't have the same set of SNPs, as some SNPs
#may be rejected by fitPoly for one probe but not for the other,
#and, if some markers are shifted, again this will not be the case for all.
#we now make some arrays that for all SNPs
#contain some data for each probe/shift combination, or NA where a probe was
#rejected or the shifted or nonshifted situation did not apply
segtype <- array(NA, dim=c(length(snpnames), 2, 2),
dimnames=list(
snpname=snpnames,
probe=c("probe1","probe2"),
shift=c("false","true")))
chkshift <- segtype #for each marker/probe/shift the actual amount of shift
qall <- segtype #for each marker/probe/shift the qall of the selected segtype (if any)
cons.parent1 <- segtype #for each marker/probe/shift the consensus parent1 score (if any)
cons.parent2 <- segtype #for each marker/probe/shift the consensus parent2 score (if any)
Rsegtype <- segtype #for each comb of probe1 shifted/nonshifted and probe2 shifted/nonshifted
# the combined segtype (R=P&Q) or NA
#we fill 5 of these 6 3D arrays with data from chk, except (for the moment) Rsegtype:
for (sh in 1:2) {
if (sh == 1) sel.sh <- chk$shift == 0 else
sel.sh <- chk$shift != 0
if (noprobes)proberange <- 1 else proberange <- 1:2
for (prb in proberange) {
if (noprobes) {
prbdat <- chk[sel.sh, ]
} else {
prbdat <- chk[chk$probe == probe.suffix[prb] & sel.sh, ]
}
rows <- match(prbdat$SNPname, snpnames)
segtype[rows, prb, sh] <- prbdat[, segtypecol]
chkshift[rows, prb, sh] <- prbdat[, shiftcol]
if (length(qallcol) == 1) qall[rows, prb, sh] <- prbdat[, qallcol]
cons.parent1[rows, prb, sh] <- prbdat$parent1
cons.parent2[rows, prb, sh] <- prbdat$parent2
}
} #for sh
seginfo <- selSegtypeInfo(calcSegtypeInfo(ploidy, ploidy2),
polysomic, disomic, mixed,
selfing=checkSelfing(parent1, parent2))
if (sum(is.na(match(unique(chk[, segtypecol]),names(seginfo)))) > 0)
stop("compareProbes: chk contains segtypes not expected with the settings\nof polysomic/disomic/mixed/ploidy/ploidy2")
scores <- scores[scores$SampleName %in% allsamp,]
#create an initial empty data frame to hold the combined scores;
#this will be increased if needed
combscores <- makeCombscoresDf(parent1=parent1, parent2=parent2, F1=F1,
ancestors=ancestors, other=other,
nrow=nrow(chk))
combrow <- 0
#subsetting from a big scores data.frame takes a lot of time
#(approx 2 sec for each marker, in a file of ~26000 markers and 480 samples)
#Therefore we subdivide it in batches of 200 SNPs
batchsize <- 200
batchcount <- (length(snpnames)-1) %/% batchsize + 1
batchnr <- 1
batchscores <- list()
while (batchsize*(batchnr-1) < length(snpnames)) {
cat(paste("batch", batchnr, "of", batchcount, "\n"))
minmrk <- batchsize * (batchnr-1) + 1
maxmrk <- min(length(snpnames), batchsize*batchnr)
batchmarkers <- snpnames[minmrk:maxmrk]
if (!noprobes) {
batchmarkers <- c(paste(batchmarkers, probe.suffix[1], sep=""),
paste(batchmarkers, probe.suffix[2], sep=""))
#note that for all snpnames both a P and a Q markername are created,
#even if only one of the two in scores
}
batchscores <- scores[scores$MarkerName %in% batchmarkers, ]
for (mrk in minmrk:maxmrk) {
#for the current marker, list sc will hold the scores:
#sc[[1]] and sc[[2]] are lists for probes P and Q (and sc[[3]] for R?)
sc <- list()
#the following 3D arrays are similar to the previously defined 6 3D arrays
#except that here the 1st dim is samples while earlier it was markers
progen.sc <- array(NA, dim=c(length(progenitors), 2, 2), #all progenitor samples scores
dimnames=list(
sample=progenitors,
probe=c("probe1","probe2"),
shift=c("FALSE","TRUE")))
F1.sc <- array(NA, dim=c(length(F1)+length(other), 2, 2), #all F1 + other samples scores
dimnames=list(
sample=c(F1, other),
probe=c("probe1","probe2"),
shift=c("FALSE","TRUE")))
parent.sc <- array(NA, dim=c(2, 2, 2), #both consensus parental scores
dimnames=list(
sample=c("parent1", "parent2"),
probe=c("probe1","probe2"),
shift=c("FALSE","TRUE")))
P0col <- which(names(scores) == "P0")
if (length(P0col) != 1)
stop("compareProbes: scores should contain columns P0 .. P<ploidy>")
for (prb in 1:2) if (any(!is.na(segtype[mrk, prb,]))) {
#unshifted and/or shifted version present for this marker/probe
if (noprobes) {
mrkname_scores <- snpnames[mrk]
} else {
mrkname_scores <- paste(snpnames[mrk], probe.suffix[prb], sep="")
}
sc[[prb]] <- list()
#sc[[prb]][[1]] gets the unshifted score data for the population samples
# i.e. all columns from the fitPoly output including
# P0..P<ploidyF1> and geno,
# and all samples in parent1, parent2, ancestors, F1, other
sc[[prb]][[1]] <- batchscores[batchscores$MarkerName == mrkname_scores &
batchscores$SampleName %in% allsamp, ]
sc[[prb]][[1]] <-
sc[[prb]][[1]][match(allsamp, sc[[prb]][[1]]$SampleName),] #sorted
#if this probe is shifted, sc[[prb]][[2]] gets the shifted score data:
if (!is.na(segtype[mrk, prb, 2])) {
sc[[prb]][[2]] <- sc[[prb]][[1]]
#shift the geno values:
sv <- chkshift[mrk, prb, 2] #shift value, positive or negative, not 0
# The shift in the next lines must not be applied to progenitors
# if shiftParents is FALSE; this is done with shfsamp
sc[[prb]][[2]]$geno[shfsamp] <- sc[[prb]][[2]]$geno[shfsamp] + sv
#old: geno shifted out of 0.. ploidyF1 made missing:
#inval <- shfsamp & !(sc[[prb]][[2]]$geno %in% 0:ploidyF1)
#sc[[prb]][[2]]$geno[inval] <- NA
#new: geno shifted out of 0..ploidyF1 made equal to 0 or ploidyF1:
inval <- shfsamp & sc[[prb]][[2]]$geno < 0
sc[[prb]][[2]]$geno[inval] <- 0
inval <- shfsamp & sc[[prb]][[2]]$geno > ploidyF1
sc[[prb]][[2]]$geno[inval] <- ploidyF1
#shift the P-value columns (only to adjust the Pmax)
if (sv > 0) {
#old: discard the P's of classes shifted beyond ploidyF1
#sc[[prb]][[2]][shfsamp, P0col+(sv:ploidyF1)] <-
# sc[[prb]][[2]][shfsamp, P0col+(0:(ploidyF1-sv))]
#new: sum all P's for classes that are merged into dosage==ploidy:
sc[[prb]][[2]][shfsamp, P0col+ploidyF1] <-
sum(sc[[prb]][[2]][shfsamp, P0col+((ploidyF1-sv):ploidyF1)])
#and shift all other existing P values:
if (sv < ploidy)
sc[[prb]][[2]][shfsamp, P0col+(sv:(ploidyF1-1))] <-
sc[[prb]][[2]][shfsamp, P0col+(0:(ploidyF1-1-sv))]
#and set the P's for classes that are new to 0:
sc[[prb]][[2]][shfsamp ,P0col+(0:(sv-1))] <- 0
} else {
#chkshift < 0
#old: discard the P's of classes shifted beyond 0
#sc[[prb]][[2]][shfsamp, P0col+(0:(ploidyF1+sv))] <-
# sc[[prb]][[2]][shfsamp, P0col+((-sv):ploidyF1)]
#new: sum all P's for classes that are merged into dosage==0:
sc[[prb]][[2]][shfsamp, P0col] <-
sum(sc[[prb]][[2]][shfsamp, P0col+(0:(-sv))])
#and shift all other existing P values:
if ((-sv) < ploidyF1)
sc[[prb]][[2]][shfsamp, P0col+(1:(ploidyF1+sv))] <-
sc[[prb]][[2]][shfsamp, P0col+((1-sv):ploidyF1)]
#and set the P's for classes that are new to 0:
sc[[prb]][[2]][shfsamp, P0col+(ploidyF1+sv+1):ploidyF1] <- 0 #NA
}
suppressWarnings(
#the warning occur when a sample is not genotyped for this marker
#(its ratio was set to NA before the fitPoly run,
#perhaps based on a too low signal intensity).
#this results in sc[[prb]][[2]]$maxP being assigned -Inf;
#that is not a problem later on
sc[[prb]][[2]]$maxP <-
apply(sc[[prb]][[2]][P0col+(0:ploidyF1)], 1, max, na.rm=TRUE)
)
}
#Now we have a list sc of data frames for the scores per probe (P,Q=1,2),
#without (1) or with (2) shift.
#Note that only the P-value columns and the geno value are shifted,
#not the maxgeno; and maxP is the max of the remaining P-values
#next we extract geno values for subsets of samples
#and write the markers:
for (sh in 1:2) {
if (!is.na(segtype[mrk, prb, sh])) {
mrkname_out <- mrkname_scores
if (shiftspresent) mrkname_out <-
paste(mrkname_out, c("n", "s")[sh], sep="")
#progen.sc[, prb, sh] <-
# sc[[prb]][[sh]]$geno[sc[[prb]][[sh]]$SampleName %in% progenitors]
#adapted, as (for triploids) parents are not always scored
progensc <- sc[[prb]][[sh]][sc[[prb]][[sh]]$SampleName %in% progenitors,]
progen.sc[match(progensc$SampleName,progenitors), prb, sh] <-
progensc$geno
F1.sc[, prb, sh] <-
sc[[prb]][[sh]]$geno[sc[[prb]][[sh]]$SampleName %in% c(F1, other)]
parent.sc[, prb, sh] <- getPargeno(cons.parent1[mrk, prb, sh],
cons.parent2[mrk, prb, sh],
segtype=segtype[mrk, prb, sh],
seginfo=seginfo)
#now we have all data for this mrk / prb / shift combination,
#we write it to the combscores data frame:
combrow <- combrow+1
if (combrow > nrow(combscores))
combscores <- rbind(combscores,
makeCombscoresDf(parent1=parent1, parent2=parent2,
F1=F1, ancestors=ancestors,
other=other, nrow=5000)
)
combscores$MarkerName[combrow] <- mrkname_out
combscores$segtype[combrow] <- segtype[mrk, prb, sh]
combscores[combrow, 3:length(combscores)] <-
c(progen.sc[, prb, sh],
parent.sc[, prb, sh],
F1.sc[, prb, sh])
}
} #for sh
} #for prb if !is.na(segtype[mrk, prb])
#now we have written the scores per probe/sh to combscorefile for current marker mrk
#finally we check for which markers we can produce combined scores for both probes:
if (!noprobes) {
numcomb <- 0
for (sh1 in 1:2) for (sh2 in 1:2)
if (!is.na(segtype[mrk, 1, sh1]) && !is.na(segtype[mrk, 2, sh2]) &&
segtype[mrk, 1, sh1] == segtype[mrk, 2, sh2]) {
bothscored <- sum(!is.na(sc[[1]][[sh1]]$geno) &
!is.na(sc[[2]][[sh2]]$geno))
diff <- sc[[1]][[sh1]]$geno != sc[[2]][[sh2]]$geno
Ndifferent <- sum(diff, na.rm=TRUE)
fracdiff <- Ndifferent / bothscored
#Note: this takes into account all samples if shiftParents TRUE,
#and only F1 + other (i.e. non-progenitor) samples if shiftParents FALSE
if (!is.na(fracdiff) && fracdiff <= fracdiff.threshold) {
#combine the scores of both probes for this marker
Rsegtype[mrk, sh1, sh2] <- segtype[mrk, 1, sh1]
numcomb <- numcomb + 1 # it is possible that more than one combination
# of shifts for the same SNP is possible
#first fill in missing scores from sc[[2]]:
combgeno <- sc[[1]][[sh1]]$geno
w <- is.na(sc[[1]][[sh1]]$geno) & !is.na(sc[[2]][[sh2]]$geno)
combgeno[w] <- sc[[2]][[sh2]]$geno[w]
#next, for the different scores take the one with highest maxP:
w <- !is.na(sc[[1]][[sh1]]$geno) &
!is.na(sc[[2]][[sh2]]$geno) &
sc[[2]][[sh2]]$geno != sc[[1]][[sh1]]$geno &
sc[[2]][[sh2]]$maxP > sc[[1]][[sh1]]$maxP
combgeno[w] <- sc[[2]][[sh2]]$geno[w]
#combine the parental consensus scores:
parent.na <- matrix(NA, nrow=2, ncol=2)
parent.na[1,] <- is.na(parent.sc[, 1, sh1])
parent.na[2,] <- is.na(parent.sc[, 2, sh2])
#parent.na has rows for probe 1 and 2, and columns for
#parent 1 and 2, for the current set of sh1 and sh2
combparent <- parent.sc[, 1, sh1] #both parental scores of probe 1
#missing in probe 1 filled in from probe 2:
w <- parent.na[1,] & !parent.na[2,]
combparent[w] <- parent.sc[w, 2, sh2]
#different in the two probes made missing (in contrast to sample
#scores, for the consensus scores we don't have P-values to
#select one):
w <- !parent.na[1,] & !parent.na[2,] &
parent.sc[, 1, sh1] != parent.sc[, 2, sh2]
combparent[w] <- NA
#possible problem left: both probes have one (a different) parent
#genotyped; in each is matchParent=OneOK, but the combination does
#not match; then set both parents to NA:
if (xor(parent.na[1,1], parent.na[1,2]) &&
xor(parent.na[2,1], parent.na[2,2]) &&
xor(parent.na[1,1], parent.na[2,1]) &&
getMatchParents(combparent, seginfo[[segtype[mrk, prb]]]) == "No") {
#note that one can index a list (or vector) using the element names!
combparent <- c(NA, NA)
}
if (shiftspresent) {
mrkname_cmb <- paste(snpnames[mrk], "R",
paste(c("n","s")[c(sh1, sh2)], collapse=""),
sep="")
} else {
mrkname_cmb <- paste(snpnames[mrk], "R", sep="")
}
combrow <- combrow+1
if (combrow > nrow(combscores))
combscores <-
rbind(combscores,
makeCombscoresDf(parent1=parent1, parent2=parent2,
F1=F1, ancestors=ancestors,
other=other, nrow=5000)
)
combscores$MarkerName[combrow] <- mrkname_cmb
combscores$segtype[combrow] <- segtype[mrk, 1, sh1]
combscores[combrow, 3:length(combscores)] <-
c(combgeno[1:length(progenitors)],
combparent,
combgeno[(length(progenitors)+1):length(combgeno)])
} #if fracdiff
} #if segtypesame
} #if !noprobes
} #for mrk
batchnr <- batchnr + 1
} #while batch probe.suffix[prb]
if (combrow==0) combscores <- combscores[0,] else
combscores <- combscores[1:combrow,]
if (!is.na(combscorefile))
writeDatfile(combscores, combscorefile)
#finally we write an overview file:
if (noprobes) {
compstat <- NA
} else {
compstat <- data.frame(SNPname=snpnames)
sufx <- matrix(c(paste(probe.suffix[1], "n", sep=""),
paste(probe.suffix[1], "s", sep=""),
paste(probe.suffix[2], "n", sep=""),
paste(probe.suffix[2], "s", sep="")), 2, 2, byrow=TRUE)
max_sh <- 2
Rsufx <- matrix(c("nn", "ns", "sn", "ss"), 2, 2, byrow=TRUE)
if (!shiftspresent) {
max_sh <- 1
sufx <- substr(sufx, 1, 1)
Rsufx <- matrix("")
}
for (prb in 1:2) for (sh in 1:max_sh) {
compstat[[paste("segtype", sufx[prb, sh], sep="")]] <- segtype[, prb, sh]
if (length(qallcol) == 1) {
compstat[[paste("qall", sufx[prb, sh], sep="")]] <- qall[, prb, sh]
}
}
for (sh1 in 1:max_sh) for (sh2 in 1:max_sh) {
compstat[[paste("segtype", probe.suffix[3], Rsufx[sh1, sh2], sep="")]] <-
Rsegtype[, sh1, sh2]
}
for (prb in 1:2) {
compstat[[paste("count", probe.suffix[prb], sep="")]] <-
rowSums(!is.na(segtype[, prb,]))
}
compstat[[paste("count", probe.suffix[3], sep="")]] <-
apply(!is.na(Rsegtype), 1, sum)
rownames(compstat) <- NULL
if (!is.na(compfile))
writeDatfile(compstat, compfile)
} #if !noprobes
invisible(list(combscores=combscores, compstat=compstat))
} # compareProbes
# removeRedundant ***************************************************************
#'@title Remove redundant single-probe markers
#'@description If for a SNP both probes produced very similar results, function
#'compareProbes generated an extra marker combining the results from both
#'probes. In that case the original single-probe markers are redundant and this
#'function removes those data from the compstat and combscores data frames.
#'
#'@usage removeRedundant(compstat, combscores, compfile, combscorefile)
#'
#'@param compstat a data.frame as returned by compareProbes in the
#'compstat item of the return value and in the compfile
#'@param combscores a data.frame as returned by compareProbes in the
#'@param compfile a filename to which the compstat part of the results is
#'written; if NA no file is written
#'@param combscorefile a filename to which the compscores part of the results is
#'written; if NA no file is written
#'
#'@return A list with two components, compstat and combscores. These are identical
#'to the parameters compstat and combscores, with the redundant single-probe
#'markers removed.For their contents see function compareProbes.
#'@export
removeRedundant <- function(compstat, combscores, compfile, combscorefile) {
#make sure there is no locked file that will prevent writing the output:
if (!is.na(compfile) && !checkFilename(compfile))
stop("removeRedundant: compfile not valid or not writable")
if (!is.na(combscorefile) && !checkFilename(combscorefile))
stop("removeRedundant: combscorefile not valid or not writable")
#get the probe.suffix letters:
sw <- which(substr(names(compstat), 1, 7) == "segtype") #column numbers in compstat
#make sure that empty segtype cells are NA:
for (i in sw) {
compstat[,i] <- as.character(compstat[,i])
compstat[!is.na(compstat[,i]) & compstat[,i] == "", i] <- NA
compstat[,i] <- factor(compstat[,i])
}
qw <- which(substr(names(compstat), 1, 4) == "qall")
#note that qall columns may or may not be present
sn <- names(compstat)[sw] #segtypePn ... segtypeRss, or segtypeP, segtypeQ, segtypeR
shiftspresent <- length(sn) == 8 #2P, 2Q, 4R; else length(sn)==3: P, Q, R
if (shiftspresent) {
shift.suffix <- c("n", "s")
probe.suffix <- c(substr(sn[1], 8, nchar(sn[1])-1), #P
substr(sn[3], 8, nchar(sn[3])-1), #Q
substr(sn[5], 8, nchar(sn[5])-2)) #R
scol <- matrix(sw[1:6], 3, 2, byrow=TRUE) #probe(P/Q/R) in row, shift in col,
# cell value is column nr in compstat
#Note: for R there are 4 columns (nn/ns/sn/ss), here only 2 are listed,
#but actually only the entry for the first R column is used
row.names(scol) <- probe.suffix
colnames(scol) <- shift.suffix
if (length(qw) > 0) qcol <- matrix(qw, 2, 2, byrow=TRUE)
max_sh <- 2 #shift 1: not shifted; shift 2: shifted
} else {
shift.suffix <- ""
probe.suffix <- c(substr(sn[1], 8, nchar(sn[1])), #P
substr(sn[2], 8, nchar(sn[2])), #Q
substr(sn[3], 8, nchar(sn[3]))) #R
scol <- matrix(sw, 3, 1) #col nrs of P/Q/R in compstat
row.names(scol) <- probe.suffix
if (length(qw) > 0) qcol <- matrix(qw, 2, 1, byrow=TRUE)
max_sh <- 1
}
omitmrk <- character(0)
for (sh1 in 1:max_sh) for (sh2 in 1:max_sh) {
combsnp <- !is.na(compstat[, scol[3, 1] + 2*(sh1-1) + (sh2-1)]) # column of
# the R markers for this combination of P and Q shifts
#for these markers, both the P and Q (with the same shift) are present and
#must be omitted:
if (sh1 != sh2) {
#if one probe was shifted and the other not, the current combination is
#the only possibly correct one and we omit also the other-shifted
#versions of the single probe markers:
omitmrk <- c(omitmrk,
paste(compstat$SNPname[combsnp],
probe.suffix[1], shift.suffix[sh1], sep=""),
paste(compstat$SNPname[combsnp],
probe.suffix[1], shift.suffix[sh2], sep=""),
paste(compstat$SNPname[combsnp],
probe.suffix[2], shift.suffix[sh1], sep=""),
paste(compstat$SNPname[combsnp],
probe.suffix[2], shift.suffix[sh2], sep=""))
compstat[combsnp, scol[1, ]] <- NA
compstat[combsnp, scol[2, ]] <- NA
if (length(qw) > 0) {
compstat[combsnp, qcol[1, ]] <- NA
compstat[combsnp, qcol[2, ]] <- NA
}
} else {
#the current combination has the same shift for both probes,
#we delete only those versions of the probes because the other-shifted
#version might actually be the correct one (and there the combination
#might not have been made due to a lower or higher number of deleted
#scores):
omitmrk <- c(omitmrk,
paste(compstat$SNPname[combsnp],
probe.suffix[1], shift.suffix[sh1], sep=""),
paste(compstat$SNPname[combsnp],
probe.suffix[2], shift.suffix[sh2], sep=""))
#erase the segtype and qall for the removed markers:
compstat[combsnp, scol[1, sh1]] <- NA
compstat[combsnp, scol[2, sh2]] <- NA
if (length(qw) > 0) {
compstat[combsnp, qcol[1, sh1]] <- NA
compstat[combsnp, qcol[2, sh2]] <- NA
}
}
}
combscores <- combscores[!(combscores$MarkerName %in% omitmrk), ]
if (!is.na(combscorefile))
writeDatfile(combscores, combscorefile)
#adjust the count columns in compstat:
for (prb in 1:2) {
compstat[[paste("count", probe.suffix[prb], sep="")]] <-
rowSums(!is.na(compstat[, scol[prb,,drop=FALSE],drop=FALSE]))
}
if (shiftspresent) Rcols <- scol[3,1]+(0:3) else Rcols <- scol[3,1] #NOT +(0:1) ???
compstat[[paste("count", probe.suffix[3], sep="")]] <-
rowSums(!is.na(compstat[, Rcols, drop=FALSE]))
if (!is.na(compfile))
writeDatfile(compstat, compfile)
invisible(list(combscores=combscores, compstat=compstat))
} #removeRedundant
#'@title Convert the F1 dosage scores data.frame or file to polymapR
#'input matrix
#'@description Functions writeDosagefile, compareProbes and removeRedundant
#'produce score files. This function reads such a file and converts the result
#'to a data.frame as needed by the polymapR package.
#'
#'@usage F1Dosages2Matrix(dosages, outfile=NA, dec=".", sep=",")
#'
#'@param dosages name of an F1 dosage scores file as produced by functions
#'writeDosagefile, compareProbes or removeRedundant, or a data frame
#'read from such a file
#'@param outfile (path and) name of an output file, default NA means that no
#'file is written. The file is written by function writeDatfile; normally using
#'the default parameters but if file extension is 'csv', quote is set to TRUE
#'and the specified dec and sep parameters are used
#'@param dec character to use as decimal separator in the output file,
#'default "."; only used if extension of outfile is 'csv' (otherwise ".")
#'@param sep character to use as field separator in the output file,
#'default ","; only used if extension of outfile is 'csv' (otherwise
#'tab character)
#'
#'@return A matrix as needed by the polymapR package: one row per marker,
#'one column per sample, containing integer scores (0 .. ploidy) or NAs.
#'Row names are marker names, column names are sample names; the first two
#'columns are the (consensus) scores of Parent1 and Parent2.\cr
#'If outfile is not NA also an output file is written: if the extension is
#''csv' a csv file is written with the specified decimal and field separators
#'(default '.' and ',') and with row and column names quoted, else a
#'tab-separated file.
#'@export
F1Dosages2Matrix <- function(dosages, outfile=NA,
dec=".", sep=",") {
if (length(outfile) != 1) outfile <- outfile[1]
if (is.null(outfile) || outfile == "") outfile <- NA
if (!is.na(outfile) & !check.filename(outfile))
stop(paste("F1Dosages2Matrix: cannot write file", outfile))
if (is.character(dosages) && length(dosages) == 1)
dosages <- readDatfile(dosages)
if (!is.data.frame(dosages))
stop("F1Dosages2Matrix: dosages must be a data.frame or a filename")
parcol1 <- which(tolower(names(dosages)) == "parent1")
pstr <- leftstr(names(dosages)[parcol1], -1) #parent in upper/lower case
if (length(parcol1) > 1) {
stop("F1Dosages2Matrix: dosages must have one column 'parent1'")
} else if (length(parcol1) == 0) {
parcol1 <- which(names(dosages) == "P1")
if (length(parcol1) != 1)
stop("F1Dosages2Matrix: dosages must have one column 'parent1'")
pstr <- "P"
}
parcol2 <- which(names(dosages) == paste(pstr, 2, sep=""))
if (length(parcol2) != 1 || parcol2 != parcol1 + 1)
stop("F1Dosages2Matrix: column for parent1 and parent2 must be consecutive")
dosages <- dosages[, -(2:(parcol1-1))]
if (!is.na(outfile)) {
if (tolower(tools::file_ext(outfile)) == "csv") {
writeDatfile(dosages, outfile, quote = TRUE, sep=sep, dec=dec)
} else writeDatfile(dosages, outfile)
}
mat <- as.matrix(dosages[, -1])
rownames(mat) <- dosages[,1]
invisible(mat)
} #F1Dosages2Matrix
# expandUnknownParents *******************************************************
#'@title Generate markers with all combinations of parental scores
#'@description For markers where the segregation type in the F1 is known but the
#'parental consensus scores are missing, this function generates multiple
#'versions of the marker, each with a different combination of parental scores
#'matching the segregation and a different suffix to the marker name.
#'
#'@usage expandUnknownParents(scores, sep="@",
#'polysomic=TRUE, disomic=FALSE, mixed=FALSE,
#'ploidy, ploidy2, scorefile)
#'
#'@param scores a data frame as returned by writeDosagefile, or the combscores
#'item in the return value of compareProbes or removeRedundant
#'@param sep a short string (one or more characters, default "@")
#'to separate the original marker name from the consecutive letters
#'(a, b, c etc) that identify the different versions of the marker. Markers
#'that already have parental scores are not modified
#'@param polysomic TRUE or FALSE; should be the same as used by checkF1
#'@param disomic TRUE or FALSE; should be the same as used by checkF1
#'@param mixed TRUE or FALSE; should be the same as used by checkF1
#'@param ploidy the ploidy of parent 1 (must be even, 2 (diploid) or larger),
#'and the same as used by checkF1
#'@param ploidy2 the ploidy of parent 2. If omitted it is assumed to be equal
#'to ploidy. Should be the same as used by checkF1
#'@param scorefile a filename to which the result is written;
#'if NA no file is written
#'
#'@return A data frame with the same format and contents as parameter scores,
#'with each marker where parental scores were missing expanded to
#'multiple rows, one per parental dosage combination
#'@export
expandUnknownParents <- function(scores,
sep="@",
polysomic=TRUE, disomic=FALSE, mixed=FALSE,
ploidy, ploidy2,
scorefile) {
if (ploidy %% 2 != 0)
stop("expandUnknownParents: odd parental ploidy not allowed")
if (missing(ploidy2) || is.na(ploidy2)) ploidy2 <- ploidy else {
if (ploidy2 %% 2 != 0)
stop("expandUnknownParents: odd parental ploidy2 not allowed")
}
ploidyF1 <- (ploidy + ploidy2) / 2
if (!polysomic && !disomic && !mixed)
stop("expandUnknownParents: at least one of polysomic, disomic and mixed must be TRUE")
if (!is.na(scorefile) && !checkFilename(scorefile))
stop("expandUnknownParents: scorefile not valid or not writable")
seginfo <- calcSegtypeInfo(ploidy, ploidy2)
allsegtypenames <- names(seginfo)
seginfo <- selSegtypeInfo(seginfo, polysomic, disomic, mixed)
unpar <- which(is.na(scores$parent1) & is.na(scores$parent2))
if (length(unpar) > 0) {
#at least one marker needs to be expanded
scores$MarkerName <- as.character(scores$MarkerName)
scores$segtype <- as.character(scores$segtype)
p12 <- which(names(scores) == "parent1")
p12 <- p12:(p12+1)
newscores <- scores[0,] # an empty data frame with same columns
for (m in unpar) {
mname <- scores$MarkerName[m]
parscores <- seginfo[[scores$segtype[m]]]$pardosage
#we put the first version into the old scores data frame:
scores$MarkerName[m] <- paste(mname, "a", sep=sep)
scores[m, p12] <- parscores[1,]
#if there are more versions we add those to newscores:
if (nrow(parscores) > 1) {
mscores <- scores[rep(m, nrow(parscores)-1),] #copy the scores row several times
mscores$MarkerName <- paste(mname, letters[2:nrow(parscores)], sep=sep)
mscores[,p12] <- parscores[-1,]
newscores <- rbind(newscores, mscores)
}
} # for m in unpar
if (nrow(newscores) > 0) {
scores <- rbind(scores, newscores)
scores <- scores[order(scores$MarkerName),]
}
} #length(unpar)>0
if (!is.na(scorefile)) writeDatfile(scores, scorefile)
scores
} #expandUnknownParents
# checkFilename *************************************************************
#'@title Check if a file can be created
#'@description Checks if a file with that name can be created.
#'If successful, any pre-existing file does not exist any more.
#'
#'@usage checkFilename(filename, overwrite=TRUE)
#'
#'@param filename a file name with or without path. If the name contains a
#'path the entire path must already exist, else the file cannot be created
#'and the result is FALSE.
#'@param overwrite if TRUE an attempt is made to delete any pre-existing file
#'of that name; the function returns FALSE if the file is locked or the user
#'has no rights to delete the file. If FALSE the function returns FALSE
#'if any file of that name already exists, and no attempt is made to remove it
#'
#'@return TRUE if a new file can be created, else FALSE. If TRUE, no file of
#'that name exists (any more)
#'@export
checkFilename <- function(filename, overwrite=TRUE) {
filename <- filename[1] # we test only the first filename
if (!overwrite && file.exists(filename)) return(FALSE)
tryCatch(
{ suppressWarnings({
if (file.exists(filename)) file.remove(filename)
if (file.create(filename)) {
file.remove(filename)
TRUE
} else FALSE})
}, error = function(e) { FALSE }
)
} #checkFilename
# scores2wide *************************************************************
#'@title Convert a scores file from long to wide format
#'@description The fitMarkers function of package fitPoly returns a
#'scores file in "long" format: one row for each MarkerName / SampleName
#'combination. This function creates a file in "wide" format (samples in
#'columns, markers in rows) containing the data from the geno column of the
#'scores file. Faster, less memory-expensive and easier to use than reshape.
#'
#'@usage scores2wide(scores, outfile=NA)
#'
#'@param scores a data frame read from the scorefile output of saveMarkeModels
#'in package fitPoly. The order of the markers within each sample does not have
#'to be the same or v.v., there is no requirement for any ordering of scores.
#'However if scores is not ordered by MarkerName and SampleName and/or not all
#'markers are present for all samples (or v.v.) this results in a slower
#'processing.
#'@param outfile a file name to which the result is written. If NA no file
#'is written.
#'
#'@return A data frame with column names marker, MarkerName and all sample
#'names, and one row per marker. Marker is the sequential number of the marker
#'as reported in the scores data frame.
#'@export
scores2wide <- function(scores, outfile=NA) {
outfile <- outfile[1]
if (!is.na(outfile) && !checkFilename(outfile))
stop("scores2wide: outfile is not a valid or writable file")
#convert Markername and sample to factors, eliminating unused levels and
#with levels sorted according to the current locale:
markernrs <- scores$marker
sampnames <- as.character(unique(scores$SampleName))
mrknames <- as.character(unique(scores$MarkerName))
#get the original marker(number) of each marker present:
marknrs <- tapply(markernrs, factor(scores$MarkerName),
FUN=function(x) mean(x, na.rm=TRUE))
rm(markernrs)
#check if scores is complete and sorted:
if (nrow(scores) > length(sampnames) * length(mrknames))
stop("scores2wide: each combination of markername and sample may occur only once")
df <- expand.grid(SampleName=sampnames, MarkerName=mrknames,
stringsAsFactors=FALSE) #needed!
if (nrow(scores) == length(sampnames) * length(mrknames)) {
#sort scores: it is important that within all samples all markers are in the
#same order:
if (sum(df$MarkerName != scores$MarkerName) > 0 ||
sum(df$SampleName != scores$SampleName) > 0) {
scores <- scores[order(scores$MarkerName, scores$SampleName),]
}
} else {
#scores incomplete: not all combinations of markername and sample occur
#(and some might be double, but we don't check for that)
#we make it complete and sort it:
scores <- merge(x=scores, y=df, all.y=TRUE, sort=TRUE)
}
#convert the geno column to wide format:
wide <- matrix(integer(length(mrknames) * length(sampnames)),
nrow=length(mrknames), ncol=length(sampnames))
keepcols <- match(c("MarkerName", "SampleName", "geno"), names(scores))
batchsize <- 10 #there are usually far less samples than markers
maxsamp <- 0
while (maxsamp < length(sampnames)) {
minsamp <- maxsamp + 1
maxsamp <- min(minsamp + batchsize - 1, length(sampnames))
batchscores <- scores[scores$SampleName %in% sampnames[minsamp:maxsamp],
keepcols]
for (samp in minsamp:maxsamp) {
sampscores <- batchscores[batchscores$SampleName == sampnames[samp],]
# if (checksc) {
# missmrk <- setdiff(1:length(mrknames), as.integer(sampscores$MarkerName))
# if (length(missmrk) > 0) {
# sampscores <- sampscores[1:length(mrknames),] #extra rows with NA
# sampscores$MarkerName[(length(mrknames)-length(missmrk)+1):
# length(mrknames)] <- mrknames[missmrk]
# }
# }
# if (samp == 1) {
# #check if markers in order of mrknames and if not, rearrange mrknames:
# marknrs <- sampscores$marker
# if (setequal(sampscores$MarkerName, mrknames)) {
# mrknames <- as.character(sampscores$MarkerName)
# } else stop ("scores2wide: all markers must be present for all samples and v.v.")
# } else {
# #samp > 1, markers must be (put) in same order
# if (!setequal(sampscores$MarkerName, mrknames))
# stop ("scores2wide: all markers must be present for all samples and v.v.")
# if (sum(sampscores$MarkerName != mrknames) > 0)
# sampscores <- sampscores[match(mrknames, sampscores$MarkerName),]
# }
wide[, samp] <- sampscores$geno
}
}
wide <- data.frame(marker = marknrs,
MarkerName = mrknames,
wide)
colnames(wide) <- c("marker", "MarkerName", sampnames)
rownames(wide) <- NULL
if (!is.na(outfile)) {
writeDatfile(wide, outfile)
}
invisible(wide)
} #scores2wide
# samplestats *************************************************************
#'@title Statistics for each F1 sample over all markers
#'@description Tabulate for each F1 sample in how many markers it has a
#'missing, invalid or valid score.
#'
#'@usage samplestats(chk, scores, F1, qall_flavor="qall_mult",
#'qall_threshold=0, ploidy, ploidy2, scores_long=TRUE)
#'
#'@param chk a data frame as returned by checkF1
#'@param scores A data frame as read from the scores file produced by function
#'fitMarkers of package fitPoly (or a subset with at least columns
#'MarkerName, SampleName and geno), or a data frame as returned by function
#'scores2wide. In the first case (default) parameter scores_long must be TRUE,
#'in the second case it must be FALSE.
#'@param F1 character vector with the sample names of the F1 individuals
#'@param qall_flavor which quality parameter column must be shown in compfile,
#'default "qall_mult". If no quality data are wanted, specify "".
#'@param qall_threshold only markers with a qall score > qall.threshold are
#'included in the tabulation
#'@param ploidy the ploidy of parent 1 (must be even, 2 (diploid) or larger),
#'and the same as used by checkF1 to calculate the chk data frame
#'@param ploidy2 the ploidy of parent 2. If omitted it is assumed to be equal
#'to ploidy. Should be the same as used by checkF1 to calculate the chk
#'data frame
#'@param scores_long TRUE if scores is in "long format", FALSE if it is in
#'"wide format" (see parameter scores)
#'
#'@return A matrix with samples in rows and 3 columns: missing, invalid, valid,
#'giving for each sample the number of markers where it has a missing, invalid
#'or valid dosage score
#'@export
samplestats <- function(chk, scores,
F1,
qall_flavor="qall_mult",
qall_threshold=0,
ploidy, ploidy2,
scores_long=TRUE) {
if (ploidy %% 2 != 0) stop("samplestats: odd parental ploidy not allowed")
if (missing(ploidy2) || is.na(ploidy2)) ploidy2 <- ploidy else
if (ploidy2 %% 2 != 0) stop("samplestats: odd parental ploidy2 not allowed")
ploidyF1 <- (ploidy + ploidy2) / 2
if (scores_long) scores <- scores2wide(scores)
#select and sort scores to have only the markers with qall>=threshold in chk,
#with markers sorted as chk and individuals as F1
#delete from chk all rows without a segtype (mostly those that are not fitted by fitPoly):
chk <- chk[!is.na(chk$bestParentfit) & chk$bestParentfit != "",]
qallcol <- which(tolower(names(chk)) == qall_flavor)
if (length(qallcol) > 1) qallcol <- qallcol[1]
chk <- chk[chk[,qallcol] > qall_threshold,]
F1 <- intersect(F1, colnames(scores)) #keeps the order of F1
scores_samp <- names(scores)[3:length(scores)] #ALL samples as ordered in scores
scores <- scores[,c(1,2, match(F1, colnames(scores)))]
mrk <- intersect(chk$MarkerName, scores$MarkerName) #keeps order of markers in chk
scores <- scores[match(mrk, scores$MarkerName),]
#calculate a list of valid dosage scores for each marker in chk:
validscores <- list()
validscores[[nrow(chk)]] <- integer(0)
seginfo <- calcSegtypeInfo(ploidy, ploidy2)
#nested function, to use in lapply
get.expgeno <- function(segtypenr) { seginfo[[segtypenr]]$expgeno }
mrk.segtype <- match(chk$bestParentfit, names(seginfo))
if (sum(is.na(mrk.segtype)) > 0)
stop("samplestats: not all segtypes in chk are valid, ploidy or ploidy2 are incorrect")
mrk.expgeno <- lapply(as.list(mrk.segtype), FUN=get.expgeno)
#calculate the totals:
result <- matrix(integer(length(F1)*3), nrow=length(F1), ncol=3)
rownames(result) <- F1
colnames(result) <- c("missing", "invalid", "valid")
for (i in seq_along(F1)) {
sc <- scores[,i+2]
result[i,1] <- sum(is.na(sc))
result[i,3] <- sum(mapply("%in%", sc, mrk.expgeno), na.rm=TRUE)
}
result[,2] <- nrow(chk) - rowSums(result)
result
} #samplestats
# selMarkers_qall *********************************************************
#'@title Sample markers at several qall levels
#'@description In order to get an idea of the quality of markers with
#'different levels of qall, this function samples marker numbers at a
#'specified set of qall values.
#'
#'@usage selMarkers_qall(chk, qall_levels, mrkperlevel=1,
#'qall_flavor="qall_mult")
#'
#'@param chk a data frame as returned by checkF1
#'@param qall_levels a numeric vector with the levels of the quality
#'parameter (qall_mult or qall_weights, see parameter qall_flavo)
#'at which markers should be selected
#'@param mrkperlevel the number of markers to select per level of qall
#'@param qall_flavor which qall to use: by default qall_mult,
#'also qall_weights is possible
#'
#'@return A subset of data frame chk with selected rows, in ascending order
#'of the qall_flavor column. The first <mrkperlevel> markers with qall >=
#'each qall.level are returned; duplicated selections are removed (so the
#'number of returned markers may be less than
#'length(qall_levels) * mrkperlevel)
#'@export
selMarkers_qall <- function(chk, qall_levels, mrkperlevel=1,
qall_flavor="qall_mult") {
qallcol <- which(names(chk) == qall_flavor)
if (length(qallcol) != 1) stop ("selMarkers_qall: invalid qall_flavor")
chk <- chk[!is.na(chk[,qallcol]),]
if (is.factor(chk[,qallcol])) chk[,qallcol] <- as.character(chk[,qallcol])
if (is.character(chk[,qallcol])) {
chk[,qallcol][chk[,qallcol] %in% c("", "NA")] <- NA
chk[,qallcol] <- as.numeric(chk[,qallcol])
}
chk <- chk[order(chk[,qallcol]),]
qall.levels <- sort(qall_levels)
firstsel <- integer(0)
for (i in seq_along(qall.levels)) {
if (chk[nrow(chk), qallcol] < qall.levels[i]) {
firstsel[i] <- nrow(chk) + 1 #no qall below threshold
} else {
firstsel[i] <- min(which(chk[,qallcol] >= qall.levels[i]))
}
}
sel <- integer(0)
for (i in 1:mrkperlevel) sel <- c(sel, firstsel + i - 1)
sel <- sort(sel[!is.na(sel)])
sel <- sel[sel <= nrow(chk)]
sel <- sel[!duplicated(sel)]
chk[sel,]
} #selMarkers_qall
# combineFiles *********************************************************
#'@title Combine the X and Y intensity scores and the assigned dosage in one file
#'@description This function combines the X and Y intensity values from the
#'fitPoly input with the geno (assigned dosage) from the fitPoly scores
#'output. Useful for producing XY scatterplots with samples colored
#'according to the assigned dosage.
#'
#'@usage combineFiles(XYdata, scores, controls=character(0))
#'
#'@param XYdata data.frame with (at least) columns MarkerName,
#'SampleName, X, Y (if present, R and ratio are also copied)
#'@param scores data.frame with scores produced by the
#'fitMarkers function of package fitPoly. It has columns MarkerName
#'and SampleName that are subsets of MarkerName and SampleName in the XYdata,
#'and at least a column geno
#'@param controls a character vector of sample names. The geno (dosage)
#'value of these samples is set to NA
#'@return a dataframe with columns MarkerName, SampleName, X, Y (and R and
#'ratio if these columns were present in XYdata), and geno, with all markers
#'and samples from XYdata. The value of geno is set to NA for all MarkerName /
#'sample combinations not present in scores, and for all samples in controls.
#'@export
combineFiles <- function(XYdata, scores, controls=character(0)) {
#NOTE: $geno is NOT increased by 1 (as it was in earlier versions)
if (!is.data.frame(XYdata) ||
length(which(names(XYdata) %in% c("MarkerName", "SampleName", "X", "Y")))
!= 4)
stop("XYdata must be a data.frame with at least columns MarkerName, SampleName, X and Y")
XYcolumns <- match(c("MarkerName", "SampleName", "X", "Y", "R", "ratio"),
names(XYdata))
XYcolumns <- XYcolumns[!is.na(XYcolumns)]
XYdata <- XYdata[, XYcolumns]
if (!is.data.frame(scores) ||
length(which(names(scores) %in% c("MarkerName", "SampleName", "geno")))
!= 3)
stop("scores must be a data.frame with at least columns MarkerName, SampleName and geno")
scores <- data.frame(MarkerName=scores$MarkerName,
SampleName=scores$SampleName,
geno=scores$geno)
alldat <- merge(XYdata, scores, by=c("MarkerName", "SampleName"), all=TRUE)
alldat$MarkerName <- factor(alldat$MarkerName)
alldat$geno[alldat$SampleName %in% controls] <- NA
alldat$SampleName <- factor(alldat$SampleName)
alldat
} #combineFiles
# correctDosages *********************************************************
calcPotShifts <- function(segtypes, ploidy) {
#calculate potential shifts in dosages
#segtypes: vector of segregation types
#result: data.frame with columns
# segtype: the segtype code before the underscore i.e. without the
# first dosage class
# mindos: the first dosage class
# shift: -1, 0 or 1: potental shifts to try out
# for each segtype whether a shift of +/- 1 should be
#tried or not (0)
#(only when number of dosages in segtype is <= ploidy-2 and
#min.dosage==1 or max.dosage==ploidy-1)
shift <- rep(0, length(segtypes))
segtypes[is.na(segtypes) | segtypes==""] <- "x_0" #avoid problems with strsplit
segtypes <- do.call(rbind, strsplit(as.character(segtypes), split="_"))
numdos <- nchar(segtypes[,1]) #number of dosage classes in segtype
compl <- substr(segtypes[,1],1,1) == "c"
lencompl <- as.integer(substr(segtypes[compl, 1], 2, numdos[compl]-1))
# nr of dosages in complex segtype
numdos[compl] <- lencompl
mindos <- as.integer(segtypes[,2])
#shift down: numdos <= ploidy-2 and lowest dosage = 1:
shift[numdos <= ploidy-2 & mindos == 1] <- -1
#shift up: numdos <= ploidy-2 and highest dosage = ploidy-1:
maxdos <- mindos + numdos - 1
shift[numdos <= ploidy-2 & maxdos == ploidy-1] <- 1
data.frame(seg=segtypes[,1], mindos=mindos, shift=shift,
stringsAsFactors=FALSE)
}
#'@title Check if dosage scores may have to be shifted
#'@description fitPoly sometimes uses a "shifted" model to assign dosage
#'scores (e.g. all samples are assigned a dosage one higher than the true
#'dosage). This happens mostly when there are only few dosages present
#'among the samples. This function checks if a shift of +/-1 is possible.
#'
#'@usage correctDosages(chk, scores, parent1, parent2, ploidy, ploidy2,
#'polysomic=TRUE, disomic=FALSE, mixed=FALSE, parentsScoredWithF1,
#'absent.threshold=0.04, outfile=NA)
#'
#'@param chk data frame returned by function checkF1 when called without
#'shiftmarkers
#'@param scores data.frame with scores as produced by the fitMarkers
#'function of package fitPoly; at least columns MarkerName, SampleName and geno
#'must be present, any other columns are ignored
#'@param parent1 character vector with names of the samples of parent 1
#'@param parent2 character vector with names of the samples of parent 2
#'@param ploidy The ploidy of parent 1 (must be even, 2 (diploid) or larger).
#'@param ploidy2 The ploidy of parent 2. If omitted it is
#'assumed to be equal to ploidy.
#'@param polysomic if TRUE at least all polysomic segtypes are considered;
#'if FALSE these are not specifically selected (but if e.g. disomic is TRUE,
#'any polysomic segtypes that are also disomic will still be considered);
#'same as used in the call to checkF1 that generated data.frame chk
#'@param disomic if TRUE at least all disomic segtypes are considered (see
#'param polysomic); same as used in the call to checkF1 that generated
#'data.frame chk
#'@param mixed if TRUE at least all mixed segtypes are considered (see
#'param polysomic). A mixed segtype occurs when inheritance in one parent is
#'polysomic (random chromosome pairing) and in the other parent disomic (fully
#'preferential chromosome pairing); same as used in the call to checkF1 that
#'generated data.frame chk
#'@param parentsScoredWithF1 single logical. TRUE means that parents are scored
#'in the same experiment and the same fitPoly run as the F1, else FALSE.
#'If missing and ploidy2==ploidy, TRUE is assumed.
#'If FALSE, parental scores will not be shifted along with the F1 scores.
#'@param absent.threshold the threshold for the fraction of ALL samples
#'that has the dosage that is assumed to be absent due to mis-fitting of
#'fitPoly; should be at least the assumed error rate of the fitPoly scoring
#'assuming the fitted model is correct
#'@param outfile file name to which the result is written. If NA no file
#'is written.
#'
#'@return a data frame with columns
#'\itemize{
#'\item{markername}
#'\item{segtype: the bestfit (not bestParentfit!) segtype from chk}
#'\item{parent1, parent2: the consensus parental dosages; possibly
#'low-confidence, so may be different from those reported in chk}
#'\item{shift: -1, 0 or 1: the amount by which this marker should be shifted}
#'}
#'The next fields are only calculated if shift is not 0:
#'\itemize{
#'\item{fracNotOk: the fraction of ALL samples that are in the dosage
#'(0 or ploidy) that should be empty if the marker is indeed shifted.}
#'\item{parNA: the number of parental dosages that is missing (0, 1 or 2)}
#'}
#'@details A shift of -1 (or +1) is proposed when (1) the fraction of all
#'samples with dosage 0 (or ploidy) is below absent.threshold, (2) the
#'bestfit (not bestParentfit!) segtype in chk has one empty dosage on the
#'low (or high) side and more than one empty dosage at the high (or low) side,
#'and (3) the shifted consensus parental dosages do not conflict with the
#'shifted segregation type.\cr
#'The returned data.frame (or a subset, e.g. based on the values in the
#'fracNotOk and parNA columns) can serve as parameter shiftmarkers in a
#'new call to checkF1.\cr
#'Based on the quality scores assigned by checkF1 to
#'the original and shifted versions of each marker the user can decide if
#'either or both should be kept. A data.frame combining selected rows
#'of the original and shifted versions of the checkF1 output (which may
#'contain both a shifted and an unshifted version of some markers) can then be
#'used as input to compareProbes or writeDosagefile.
#'@export
correctDosages <- function(chk, scores, parent1, parent2, ploidy, ploidy2,
polysomic=TRUE, disomic=FALSE, mixed=FALSE,
parentsScoredWithF1,
absent.threshold=0.04, outfile=NA) {
#This function identifies markers that are probably misscored by fitPoly
#(bestfit in 1_1, 1_3, 11_1, 11_2 (if ploidy==4) and parents in same dosage
#classes;
# note that 1_1 and 1_3 can only occur if disomic=TRUE in check.F1)
#and produce data to suggests a revised scoring for F1 and parent, based on
#the segregation in an F1 population and the scores in all other samples
if (missing(ploidy2) || is.na(ploidy2)) {
ploidy2 <- ploidy
if (missing(parentsScoredWithF1)) parentsScoredWithF1 <- TRUE
} else
if (ploidy2 %% 2 != 0) stop("checkF1: odd ploidy2 not allowed")
ploidyF1 <- (ploidy + ploidy2) / 2
if (is.na(outfile)) outfile <- ""
if (outfile != "" && !checkFilename(outfile)) {
stop("correctDosages: invalid outfile")
}
#if (ploidy %% 2 != 0)
# stop("correctDosages: only even ploidy allowed (parents must have been genotyped together with F1")
if (sum(is.na(match(c(parent1, parent2), names(chk)))) > 0)
stop("correctDosages: not all parental samples in names of chk")
if ("shift" %in% names(chk))
stop("correctDosages: chk must be calculated without shift")
seginfo <- segtypeInfoSummary(calcSegtypeInfo(ploidy, ploidy2))
seginfo <- seginfo[(polysomic & seginfo$par.poly==1) |
(disomic & seginfo$par.di==1) |
(mixed & seginfo$par.mixed==1),]
chk <- chk2integer(chk)
if (is.factor(chk$bestfit)) chk$bestfit <- as.character(chk$bestfit)
NAcol <- which(names(chk) == "F1_NA") # last before parental samples
#calculate a (possibly very low conf) new parental consensus:
#(the one in chk may be based partly on bestParentfit)
for (p in 1:2) {
if (p==1) parent <- parent1 else parent <- parent2
pc <- which(names(chk) == "parent1") - 1 + p
chk[,pc] <- rep(NA, nrow(chk))
if (length(parent) > 0) {
#for calculating the consensus score (pcons) we need to calculate the
#parallel min and max over the columns of the parental samples;
pcons <- apply(chk[,(NAcol+1):(NAcol+length(parent)), drop=FALSE],
1, min) #no na.rm=TRUE !!
rw <- !is.na(pcons) & pcons ==
apply(chk[,(NAcol+1):(NAcol+length(parent)), drop=FALSE], 1, max)
chk[rw, pc] <- pcons[rw] #so the "old" parent1 and parent2 columns now
# have a new type of consensus score
NAcol <- NAcol + length(parent) #startpoint for parent2
}
}
checksamp <- as.character(unique(scores$SampleName))
if (!parentsScoredWithF1)
checksamp <- setdiff(checksamp, c(parent1, parent2))
scores <- scores[scores$MarkerName %in% as.character(chk$MarkerName) &
scores$SampleName %in% checksamp,]
scores$MarkerName <- factor(scores$MarkerName) #drop unused levels
nonNA <- function(x) sum(!is.na(x))
is0 <- function(x) sum(x==0, na.rm=TRUE)
isploidy <- function(x) sum(x==ploidyF1, na.rm=TRUE)
scores.nonNA <- tapply(scores$geno, scores$MarkerName, FUN=nonNA)
scores.frc0 <- tapply(scores$geno, scores$MarkerName, FUN=is0)
scores.frc0 <- scores.frc0 / scores.nonNA
scores.frcploidy <- tapply(scores$geno, scores$MarkerName, FUN=isploidy)
scores.frcploidy <- scores.frcploidy / scores.nonNA
scores.frc0 <- scores.frc0[order(names(scores.frc0))] #needed!
scores.frcploidy <- scores.frcploidy[order(names(scores.frcploidy))] #needed!
#these are now in alphabetical order; do the same with the chk data frame:
ch.ord <- order(chk$MarkerName) #we need this later
chk <- chk[ch.ord,]
potShifts <- calcPotShifts(chk$bestfit, ploidyF1) #potential shift value of
# segtype for each marker
#for which rows do we need to check a potential shift?
mf <- !is.na(potShifts$shift) &
( (potShifts$shift == -1 & scores.frc0 <= absent.threshold) |
(potShifts$shift == 1 & scores.frcploidy <= absent.threshold) )
if (parentsScoredWithF1) {
shfpar <- cbind(chk$parent1 + potShifts$shift,
chk$parent2 + potShifts$shift)[mf,]
} else shfpar <- cbind(chk$parent1, chk$parent2)[mf,]
# (non)shifted parental consensus dosages, only for the mf markers
shf_segtype <- paste(potShifts[mf, 1],
(potShifts[mf, 2] + potShifts[mf, 3]), sep= "_")
# shifted segtype: original first dosage + shift, only for the mf markers
match_p_seg <- function(segtype, p1, p2) {
# for one segtype: do the parental consensus dosages (p12) match it?
if (is.na(p1) && is.na(p2)) return(TRUE)
si <- seginfo[seginfo$segtype==segtype, 3:4] #parental dosage columns
if (is.na(p1)) return(p2 %in% si[,2])
#p1 now not NA
if (is.na(p2)) return(p1 %in% si[,1])
return(any(si[,1]==p1 & si[,2]==p2))
}
goodshifts <- as.logical(mapply(FUN=match_p_seg,
shf_segtype, shfpar[,1], shfpar[,2]))
# as.logical needed because mapply returns an empty list if mf all FALSE
shiftmf <- rep(0, sum(mf))
shiftmf[goodshifts] <- potShifts$shift[mf][goodshifts]
p1namf <- p2namf <- onepokmf <- bothpokmf <- bothpnamf <- rep(NA, sum(mf))
p1namf[goodshifts] <- is.na(shfpar[goodshifts, 1])
p2namf[goodshifts] <- is.na(shfpar[goodshifts, 2])
nmrk <- nrow(chk)
P1na = rep(NA, nmrk)
P2na = rep(NA, nmrk)
result <- data.frame(
MarkerName = chk$MarkerName,
segtype = chk$bestfit,
parent1 = chk$parent1,
parent2 = chk$parent2,
shift = rep(0, nmrk),
fracNotOk = rep(NA, nmrk) #fraction of all samples in the forbidden class
# (0 or ploidyF1)
)
result$shift[mf] <- shiftmf
#the next columns will get non-NA values only for accepted shifts
result$fracNotOk[result$shift == -1] <- scores.frc0[result$shift == -1]
result$fracNotOk[result$shift == 1] <- scores.frcploidy[result$shift == 1]
P1na[mf] <- p1namf
P2na[mf] <- p2namf
result$ParNA <- P1na + P2na #0, 1 or 2
#sort result into same order as original chk:
result <- result[order(ch.ord),]
if (outfile != "")
writeDatfile(result, outfile)
invisible(result)
} #correctDosages
#Functions to recover from a fitMarkers crash when the temporary batchfiles
#(RData files) are still there:
#'@title get the names of all batch files present
#'@description This function gets the names of all batch files already produced
#'by fitMarkers; used to recover data after a crash.
#'@usage getBatchFiles(filePrefix, set)
#'@param filePrefix the same filePrefix as used in the fitMarkers call
#'@param set one of "models" or "scores"
#'@return a vector of batch file names, either the scores set or the models set
#'@export
getBatchFiles <- function(filePrefix, set) {
#filePrefix is the same as that for fitMarkers, including path if needed
#set should be one of "models" or "scores"
#result: a list of all temporary batch-filenames for either scores or models
allfiles <- dir()
fname_start <- paste(filePrefix, "_sMM", set, sep="")
len <- nchar(fname_start)
sort(allfiles[substr(allfiles,1,len) == fname_start])
} #getBatchFiles
#'@title Construct the log, models and scores files from a set of batch files
#'@description Get the data saved by fitMarkers in batch files before it
#'crashed and construct the log, models and score files just as fitMarkers
#'would have done
#'@usage concatbatch(batchfiles)
#'@param batchfiles a vector of batch file names as returned by getBatchFiles
#'@details all batch files are assumed to contain a list with the same number
#'of element, each either a character vector or a data.frame. This function
#'concatenates the elements across the batch files.\cr
#'This may be useful if fitMarkers has already been running a long time
#'and then crashed. The partial (or perhaps complete) data can then be
#'recovered from the saved batch files, so that only the remaining markers
#'(if any) need to be processed afterwards.
#'@return a list with the concatenated elements: character vectors and/or
#'data.frames
#'@export
concatbatch <- function(batchfiles) {
#batchfiles is a vector of filenames as returned by getBatchFiles
#each of the files in batchfiles are assumed to contain a list; in each case
#with the same elements (data.frames and/or vectors)
#The result is a similar list where each element is a concatenation
#(with rbind() or c()) of the list elements from the batchfiles
#These can then be saved to files, after sorting if desired. They contain
#the same information as the output files of fitMarkers, but only for
#the completed batches.
vars <- load(batchfiles[1])
if (length(vars) != 1) stop("batchfiles not supported")
allbat <- get(vars)
if (!is.list(allbat)) stop("batchfiles not supported")
for (f in 2:length(batchfiles)) {
print(paste("f =", f))
vf <- load(batchfiles[f])
if (length(vf) != 1) stop("batchfiles different types")
nwbat <- get(vf)
if (!is.list(nwbat) || !identical(names(allbat), names(nwbat)))
stop("batchfiles different types")
for (i in 1:length(nwbat)) {
if (is.data.frame(nwbat[[i]])) {
allbat[[i]] <- rbind(allbat[[i]], nwbat[[i]])
} else if (is.vector(nwbat[[i]]))
allbat[[i]] <- c(allbat[[i]], nwbat[[i]])
}
}
allbat
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.