R/pop_predict.R

Defines functions pop_predict2 pop.predict2 pop.predict

Documented in pop.predict pop_predict2 pop.predict2

#' A genome-wide procedure for predicting genetic variance and correlated response in bi-parental breeding populations
#' 
#' @description \code{pop.predict} uses phenotypic and genotypic data from a set of individuals known as a training population (TP) and a set of candidate parents, which may or may not be included in the TP, to predict the mean (\eqn{\mu}), genetic variance (\emph{V_G}), and superior progeny values (\eqn{\mu}\emph{_sp}) of the half-diallel, or a defined set of pairwise bi-parental crosses between parents. When multiple traits are provided \code{pop.predict} will also predict the correlated responses and correlation between all pairwise traits. See \cite{Mohammadi, Tiede, and Smith (2015)} for further details.
#' 
#'              NOTE - \code{pop.predict} writes and reads files to disk so it is highly recommended to set your working directory
#' @param G.in \code{Matrix} of genotypic data. First row contains marker names and the first column contains entry (taxa) names. Genotypes should be coded as follows: \itemize{
#'          \item \code{1}: homozygous for minor allele
#'          \item \code{0}: heterozygous
#'          \item \code{-1}: homozygous for major allele
#'          \item \code{NA}: missing data
#'          \item Imputed genotypes can be passed, see \code{impute} below for details
#'          }
#'          TIP - Set header=\code{FALSE} within \code{\link{read.table}} or \code{\link{read.csv}} when importing a tab-delimited file containing data for \code{G.in}.
#' @param y.in \code{Matrix} of phenotypic data. First column contains entry (taxa) names found in \code{G.in}, regardless of whether the entry has a phenotype for any or all traits. Additional columns contain phenotypic data; column names should reflect the trait name(s). TIP - Set header=\code{TRUE} within \code{\link{read.table}} or \code{\link{read.csv}} when importing a tab-delimited file containing data for \code{y.in}.
#' @param map.in \code{Matrix} of genetic map data, three columns total. Column 1 contains marker names, column 2 contains chromosome number, and column 3 contains cM positions. TIP - Set header=\code{TRUE} within \code{read.table} or \code{read.csv} when importing a tab-delimited file contianing data for \code{map.in}.
#' @param crossing.table Optional \code{matrix} specifying which crosses are to be simulated, two columns total. Column 1 contains the first parent of the cross (Par1) and column 2 contains the second parent of the cross (Par2).
#' @param parents Optional \code{character vector}. If \code{parents="TP"} then only the entries (taxa) within the training population (i.e. are phenotyped for the trait) are considered as parents; all pairwise crosses will be simulated for these. User could otherwise provide a character vector of entry names; all pairwise crosses will be simulated for these.
#' @param tail.p Optional \code{numeric} indicating the percentile of the simulated progeny to be included into the calculation of \eqn{\mu}\emph{_sp} and correlated response. Default is \code{0.10}.
#' @param nInd Optional \code{integer} indicating the number of progeny simulated per cross, per iteration, using \code{\link[qtl]{sim.cross}} in R/qtl (\emph{Broman et al., 2003}). Default is \code{200}.
#' @param map.plot Optional \code{logical}. If \code{TRUE} then a plot of the genetic map will be generated by \code{\link[qtl]{plot.map}}. Default is \code{FALSE}.
#' @param impute Options include \code{c("EM", "mean", "pass")}. By default (i.e. \code{"EM"}), after filtering missing genotypic data will be imputed via the EM algorithm implemented in \code{\link{rrBLUP}} (\cite{Endelman, 2011}; \cite{Poland et al., 2012}). If \code{"mean"} missing genotypic data will be imputed via the 'marker mean' method, also implemented in \code{\link{rrBLUP}}. Enter \code{"pass"} if a pre-filtered and imputed genotype matrix is provided to \code{G.in}.
#' @param min.maf Optional \code{numeric} indicating a minimum minor allele frequency (MAF) when filtering \code{G.in}. Markers with an MAF < \code{min.maf} will be removed. Default is \code{0.01} to remove monomorphic markers. Set to \code{0} for no filtering.
#' @param mkr.cutoff Optional \code{numeric} indicating the maximum missing data per marker when filtering \code{G.in}. Markers missing > \code{mkr.cutoff} data will be removed. Default is \code{0.50}. Set to \code{1} for no filtering.
#' @param entry.cutoff Optional \code{numeric} indicating the maximum missing genotypic data per entry allowed when filtering \code{G.in}. Entries missing > \code{entry.cutoff} marker data will be removed. Default is \code{0.50}. Set to \code{1} for no filtering.
#' @param remove.dups Optional \code{logical}. If \code{TRUE} duplicate entries in the genotype matrix, if present, will be removed. This step may be necessary for missing marker imputation (see \code{impute} below). Default is \code{TRUE}.
#' @param nSim Optional \code{integer} indicating the number of iterations a population should be simulated for each pairwise cross. Returned values are reported as means of parameters estimated in each of \code{nSim} simulations. Default is \code{25}.
#' @param frac.train Optional \code{numeric} indicating the fraction of the TP that is used to estimate marker effects (i.e. the prediction set) under cross-validation (CV) method 1 (see \code{Details} in \code{\link{x.val}}). The remaining \eqn{(1-frac.trait)} of the TP will then comprise the prediction set.
#' @param nCV.iter Optional \code{integer} indicating the number of times to iterate \emph{CV method 1} (see \code{Details} in \code{\link{x.val}}). Default is \code{100}.
#' @param nFold Optional \code{integer}. If a number is provided, denoting the number of "folds", then CV will be conducted using \emph{CV method 2} (see \code{Details} in \code{\link{x.val}}). Default is \code{NULL}, resulting in the default use of the \emph{CV method 1}.
#' @param nFold.reps Optional \code{integer} indicating the number of times \emph{CV method 2} is repeated. The CV accuracy returned is the average \emph{r} of each rep. Default is \code{1}.
#' @param nIter,burnIn Optional \code{integer} arguments used by \code{\link[BGLR]{BGLR}} (\cite{de los Compos and Rodriguez, 2014}) when fitting Bayesian models to estimate marker effects. The defaults are \code{12000} and \code{3000}, respectively. These values when conducting CV are fixed \code{1500} and \code{500}, respectively, for computational efficiency.
#' @param models Optional \code{Character vector} of the regression models to be used in CV and to estimate marker effects. Options include \code{rrBLUP, BayesA, BayesB, BayesC, BL, BRR}, one or more may be included at a time. CV will be conducted regardless of how many models are included. By default all models are tested.
#' @param return.raw Optional \code{logical}. If \code{TRUE} then \code{pop.predict} will return the results of each simulation in addition to the summarized dataframe. Default is \code{FALSE}.
#' @details \code{pop.predict} can be used to predict the mean (\eqn{\mu}), genetic variance (\emph{V_G}), superior progeny values (\eqn{\mu}\eqn{_sp}), as well as the predicted correlated response and correlations between all pairwise traits. The methodology and procedure to do so has been described in \cite{Bernardo (2014)} and \cite{Mohammadi, Tiede, and K.P. Smith (2015)}. Users familiar with genome-wide prediction, association mapping, and/or linkage mapping will be familiar with the
#'          required inputs of \code{pop.predict}. \code{G.in} includes all of the entries (taxa) in the TP as well as additional entries to be considered as parent candidates. Entries included in \code{G.in} that do have a phenotype for any or all traits in \code{y.in} are considered TP entries for those respective traits. \code{G.in} is filtered according to \code{min.maf}, \code{mkr.cutoff}, \code{entry.cutoff}, and \code{remove.dups};
#'          remaining missing marker data is imputed using the EM algorithm (\cite{Poland et al., 2012}) when possible, and the marker mean otherwise, both implemented in \code{\link{rrBLUP}}. For each trait, the TP (i.e. entries with phenotype) is used to: \enumerate{
#'          \item Perform CV to select a regression model. NOTE - Using the model with the highest CV accuracy is expected to result in the most accurate marker effect estimates (\cite{Bernardo, 2014}). This expectation, however, is yet to be empirically validated and the user is encouraged to investigate the various models in order to make an educated decision about which one to ultimately use.
#'          \item Estimate marker effects using the model resulting in the highest CV accuracy
#'          }
#'          Models include ridge regression BLUP implemented in \code{\link{rrBLUP}} (\cite{Endelman, 2011}) and BayesA, BayesB, BayesC\eqn{\pi}, Bayesian lasso (BL), and Bayesian ridge regression (BRR) implemented in \code{\link{BGLR}} (\cite{de los Compos and Rodriguez, 2014}).
#'          Information from the \code{map.in} is then used to simulate chromosomal recombination expected in a recombinant inbred line (i.e. \emph{F-infinity}) (\cite{Broman et al., 2003}) population (size=\code{nInd}). A function then converts the recombined chromosomal segments of the generic RIL population to the chromosomal segments of the population's respective parents and GEBVs of the simulated progeny are calculated.
#'          The simulation and conversion process is repeated \emph{s} times, where \emph{s} = \code{nSim}, to calculate dispersion statistics for \eqn{\mu} and \emph{V_G}; the remainder of the values in the \code{predictions} output are means of the \emph{s} simulations.  During each iteration the correlation (\emph{r}) and correlated response of each pairwise combination of traits is also calculated and their mean across \emph{n} simulations is returned.
#'          The correlated response of trait.B when predicting trait.A is the mean of trait.B for the (\eqn{\mu}\eqn{_sp}) of trait.A, and vice-versa; a correlated response for the bottom \code{tail.p} and upper \eqn{1-tail.p} is returned for each trait.
#'          
#'          A dataset \code{\link{think_barley.rda}} is provided as an example of the proper formatting of input files and also for users to become familiar with \code{pop.predict}.
#' @return A \code{list} containing: \itemize{ 
#'            \item \code{predictions} A \code{list} of dataframes containing predictions of (\eqn{\mu}), (\emph{V_G}), and (\eqn{\mu}\emph{_sp}). When multiple traits are provided the correlated responses and correlation between all pairwise traits is also included. More specifically, for a given trait pair the correlated response of the secondary trait with both the high and low superior progeny of the primary trait is returned since the favorable values cannot be known by \code{PopVar}.
#'            \item \code{preds.per.sim} If return.raw is \code{TRUE} then a \code{dataframe} containing the results of each simulation is returned. This is useful for calculating dispersion statistics for traits not provided in the standard \code{predictions} dataframe.
#'            \item \code{CVs} A \code{dataframe} of CV results for each trait/model combination specified.
#'            \item \code{models.chosen} A \code{matrix} listing the statistical model chosen for each trait.
#'            \item \code{markers.removed} A \code{vector} of markers removed during filtering for MAF and missing data.
#'            \item \code{entries.removed} A \code{vector} of entries removed during filtering for missing data and duplicate entries.
#'          }
#' @references 
#'      Bernardo, R. 2014. Genomewide Selection of Parental Inbreds: Classes of Loci and Virtual Biparental Populations. Crop Sci. 55:2586-2595.
#'      
#'      Broman, K. W., H. Wu, S. Sen and G.A. Churchill. 2003. R/qtl: QTL mapping in experimental crosses. Bioinformatics 19:889-890.
#'      
#'      Endelman, J. B. 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4:250-255. doi: 10.3835/plantgenome2011.08.0024
#'
#'      Gustavo de los Campos and Paulino Perez Rodriguez, (2014). BGLR: Bayesian Generalized Linear Regression. R package version 1.0.3. http://CRAN.R-project.org/package=BGLR
#'      
#'      Mohammadi M., T. Tiede, and K.P. Smith. 2015. PopVar: A genome-wide procedure for predicting genetic variance and correlated response in bi-parental breeding populations. Crop Sci. \emph{Accepted}.
#'      
#'      Munoz-Amatriain, M., M. J. Moscou, P. R. Bhat, J. T. Svensson, J. Bartos, P. Suchankova, H. Simkova, T. R. Endo, R. D. Fenton, S. Lonardi, A. M. Castillo, S. Chao, L. Cistue, A. Cuesta-Marcos, K. L. Forrest, M. J. Hayden, P. M. Hayes, R. D. Horsley, K. Makoto, D. Moody, K. Sato, M. P. Valles, B. B. H. Wulff, G. J. Muehlbauer, J. Dolezel, and T. J. Close. 2011 An improved consensus linkage map of barley based on flow-sorted chromosomes and single nucleotide polymorphism markers. Plant Gen. 4:238-249.
#'      
#'      Poland, J., J. Endelman, J. Dawson, J. Rutkoski, S. Wu, Y. Manes, S. Dreisigacker, J. Crossa, H. Sanches-Villeda, M. Sorrells, and J.-L. Jannink. 2012. Genomic Selection in Wheat Breeding using Genotyping-by-Sequencing. Plant Genome 5:103-113.
#'      
#' @examples 
#' \dontrun{
#' ## View formatting
#' ## Use View() in RStudio or R GUI with X11 forwarding
#' ## Use head() in R GUI without X11 forwarding
#' View(G.in_ex)
#' View(y.in_ex)
#' View(map.in_ex)
#' View(cross.tab_ex)
#' 
#' ## setwd() - pop.predict writes and reads files to disk
#' ##   so it is recommended to set your working directory
#' 
#' ## nSim and nFold are set to low values in the
#' ## examples for sake of computing time
#' 
#' ## Ex. 1 - Predict a defined set of crosses
#' ## This example uses CV method 1 (see Details of x.val() function)
#' ex1.out <- pop.predict(G.in = G.in_ex, y.in = y.in_ex, 
#'    map.in = map.in_ex, crossing.table = cross.tab_ex,
#'    nSim=5, nCV.iter=10)
#' ex1.out$predictions  ## Predicted parameters
#' ex1.out$CVs          ## CV results
#'                
#' ## Ex. 2 - Predict all pairwise crosses between a list of parents
#' ## This example uses CV method 2 (see Details of x.val() function)
#' par.list <- sample(y.in_ex[,1], size = 10, replace = F)
#' ex2.out <- pop.predict(G.in = G.in_ex, y.in = y.in_ex,
#'    map.in = map.in_ex, parents = par.list, 
#'    nSim=5, nFold=5, nFold.reps=2)
#'        
#' ## Ex. 3 - Use only rrBLUP and Bayesian lasso (BL) models
#' ex3.out <- pop.predict(G.in = G.in_ex, y.in = y.in_ex,
#'    map.in = map.in_ex, crossing.table = cross.tab_ex,
#'    models = c("rrBLUP", "BL"), nSim=5, nCV.iter=10)  
#' }
#' 
#' @importFrom utils write.table read.csv capture.output setTxtProgressBar txtProgressBar
#' @importFrom stats var quantile 
#' 
#' @export
#' 
pop.predict <- function(G.in=NULL, y.in=NULL, map.in=NULL, crossing.table=NULL, 
                        parents=NULL, tail.p=0.10, nInd=200, map.plot=F, min.maf=0.01, 
                        mkr.cutoff=0.50, entry.cutoff=0.50, remove.dups=T, impute="EM", 
                        nSim=25, frac.train=0.60, nCV.iter=100, nFold=NULL, nFold.reps=1, 
                        nIter=12000, burnIn=3000, models=c("rrBLUP", "BayesA", "BayesB","BayesC", "BL", "BRR"), 
                        return.raw=F){
  
  ## QC steps
  if(is.null(G.in)) stop("Must provide a genotype (G.in) file.")
  if(is.null(y.in)) stop("Must provide a phenotype (y.in) file.")
  if(is.null(map.in)) stop("Must provide a map (map.in) file.")
  if(!is.null(min.maf) & min.maf >= 1) stop("min.maf must be within the range [0, 1)")
  if(!is.null(entry.cutoff) & entry.cutoff > 1) stop("entry.cutoff must be within the range (0, 1]")
  if(!is.null(mkr.cutoff) & mkr.cutoff > 1) stop("mkr.cutoff must be within the range (0, 1]")
  if(impute == "pass"){min.maf=0; mkr.cutoff=1; entry.cutoff=1}
  
  ### Requird functions found in 'Internal_PopVar_functions_2.20.15.R'  
  
  ###### START HERE ##############
  ## Step 1 - Parse out Geno and Map files
  G.entries <- as.character(G.in[-1, 1])
  entries.removed <- NULL; entries.to.remove <- c() ## This is needed for output, may be replaced by list of entries if filtering is enabled or if duplicate entries found
  G.markers <- as.character(t(G.in[1, -1]))
  map.markers <- as.character(map.in[,1])
  mkrs.removed <- NULL; mkrs.to.remove <- c()  ## This is needed for output, may be replaced by list of markers if filtering is enabled
  
  ## Marker, map, and geno QC
  if(!all(G.markers %in% map.markers) & all(map.markers %in% G.markers)) stop("Markers in Genotype matrix and genetic map do not completely match.")
  map <- map.in[order(map.in[,2], map.in[,3]), ] ## Sort map by chr then cM pos
  G.mat <- as.matrix(G.in[-1, -1]); class(G.mat) <- "numeric"
  G.mat <- G.mat[,order(match(map.markers, G.markers))] ## Sort G.mat so it is in same order as the map
  if(impute != "pass" && !all(unique(G.mat[,1]) %in% c(-1, 0, 1, NA))) stop("\nNon-imputed genotypes need to be coded as -1, 0, 1.\nIf imputed genotypes are passed set impute = 'pass'.")
  
  ### Filter markers first for missing data and MAF
  mkrs.to.remove <- c()
  if(min.maf > 0){maf.list <- apply(G.mat, 2, maf_filt); mkrs.to.remove <- c(mkrs.to.remove, which(maf.list < min.maf))}
  if(mkr.cutoff <1){mkrNA.list <- apply(G.mat, 2, function(M){return(length(which(is.na(M))) / length(M))})
  mkrs.to.remove <- unique(c(mkrs.to.remove, which(mkrNA.list > mkr.cutoff)))}
  
  if(length(mkrs.to.remove > 0)){
    G.mat <- G.mat[, -mkrs.to.remove]
    map <- map[-mkrs.to.remove, ]
    mkrs.removed <- map.markers[mkrs.to.remove]
    map.markers <- map.markers[-mkrs.to.remove]
  }
  
  ### Filter for duplicated entries and missing data
  entries.to.remove <- c()
  if(remove.dups) entries.to.remove <- c(entries.to.remove, which(duplicated.array(G.mat)))
  if(entry.cutoff < 1){entryNA.list <- apply(G.mat, 1, function(E){return(length(which(is.na(E))) / length(E))})
  entries.to.remove <- unique(c(entries.to.remove, which(entryNA.list > entry.cutoff)))}
  
  if(length(entries.to.remove > 0)){
    G.mat <- G.mat[-entries.to.remove, ]
    entries.removed <- G.entries[entries.to.remove]
    G.entries <- G.entries[-entries.to.remove]
    if(!is.null(crossing.table)){ ## Removes crosses from crossing.table that include a parent in entries.removed
      cross.tabl.1col <- rbind(cbind(as.numeric(row.names(crossing.table)), as.character(crossing.table[,1])), cbind(as.numeric(row.names(crossing.table)), as.character(crossing.table[,2])))
      tab.rows.2remove <- as.numeric(unique(unlist(sapply(entries.removed, function(X){return(cross.tabl.1col[grep(X, cross.tabl.1col[,2]), 1])}))))
      if(length(tab.rows.2remove) > 0) crossing.table <- crossing.table[-tab.rows.2remove, ]
    }
    if(!is.null(parents)){parents <- parents[!parents %in% entries.removed]} ## Removes entries that are included in entries.removed from parent list 
  }
  
  y <- as.matrix(y.in[match(G.entries, as.character(y.in[,1])),])
  y.entries <- as.character(y[,1])
  traits <- as.character(colnames(y))[-1]; nTraits <- length(traits)
  
  
  ## Format map and write out
  name4out <- sample(10000:99999, 1)
  t.map <- t(map); rownames(t.map) <- NULL
  map4out <- cbind(c("pheno", "", ""), t.map)
  write.table(map4out, paste("map.tmp_", name4out, ".csv", sep=""), row.names=F, col.names=F, sep=",")
  
  
  ### Read in genetic map for the markers
  options(warn=-1) ## Temporarily turn off warnings
  read.map.out <-capture.output(read.map <- qtl::read.cross(format="csv", crosstype = "riself", file= paste("map.tmp_", name4out, ".csv", sep=""), na.strings="NA"))
  print(paste("Number of Markers Read in: ", unlist(strsplit(read.map.out[3], split = " "), recursive = T)[2], sep = ""), quote=F)
  unlink(paste("map.tmp_", name4out, ".csv", sep=""))
  map_t1 <- qtl::pull.map(read.map)
  options(warn=0)
  if(map.plot==T) qtl::plot.map(map_t1)
  
  
  ## Imput missing markers with EM... will switch to imputing with the mean if nEntries > nMarkers
  ## Will need to use our own MAF filter so that we can keep track of which markers are removed due to MAF and missing data
  if(impute == "EM") G.imp <- rrBLUP::A.mat(G.mat, min.MAF = 0, max.missing = 1, impute.method = "EM", return.imputed = T)$imputed
  if(impute == "mean") G.imp <- rrBLUP::A.mat(G.mat, min.MAF = 0, max.missing = 1, impute.method = "mean", return.imputed = T)$imputed
  if(impute == "pass") G.imp <- G.mat
  
  ### Start simulation
  for(t in 1:nTraits){
    
    trait <- traits[t]
    
    ### the y.noNA and G.noNA define the TP and are used for CV and marker effect estimation
    
    y_notNAs <- !is.na(y[,trait])
    
    y_TP <- as.numeric(y[y_notNAs, trait])
    TP.entries <- y.entries[y_notNAs]
    G_TP <- G.imp[y_notNAs, ] 
    
    if(t==1){
      cat("\n")
      cat("Warnings about 'closing unused connections' AND 'Error in rinvGauss' can be safely disregarded... They are dealt with internally")
      cat("\n")
    }
    
    cat("\n")
    cat(paste("Selecting best model via cross validation for ", trait, " and estimating marker effects", sep=""))
    cat("\n")    
    
    if(is.null(nFold)) junk <- capture.output(xval.out <- XValidate_nonInd(y.CV = y_TP, G.CV = G_TP, models.CV = models, frac.train.CV=frac.train, nCV.iter.CV=nCV.iter, burnIn.CV = 750, nIter.CV = 1500)$CV.summary)
    if(!is.null(nFold)) junk <- capture.output(xval.out <- XValidate_Ind(y.CV = y_TP, G.CV = G_TP, models.CV = models, nFold.CV = nFold, nFold.CV.reps = nFold.reps, burnIn.CV = 750, nIter.CV = 1500)$CV.summary)
    
    if(t == 1){
      CV.results <- list()
      mkr_effs.mat <- matrix(NA, ncol = nTraits, nrow = ncol(G.imp))
      colnames(mkr_effs.mat) <- traits
      beta.list <- c()
      best.models <- c()
    }
    
    CV.results[[t]] <- xval.out
    names(CV.results)[[t]] <- trait
    
    best.model <- as.character(xval.out$Model[which(xval.out$r_avg == max(xval.out$r_avg))])
    best.models[t] <- best.model; names(best.models)[t] <- trait
    
    if(best.model == "rrBLUP"){
      mix.solve.out <- rrBLUP::mixed.solve(y_TP, Z=G_TP, SE=F, return.Hinv=F)
      beta <- as.numeric(mix.solve.out$beta)
      mkr_effects <- mix.solve.out$u
    }else{
      capture.output(bayes.fit <- BGLR::BGLR(y=y_TP, ETA=list(list(X=G_TP, model=best.model)), verbose=F, nIter=nIter, burnIn=burnIn))
      mkr_effects <- as.numeric(bayes.fit$ETA[[1]]$b)
      beta <- bayes.fit$mu  
    }
    
    beta.list[t] <- beta
    mkr_effs.mat[,t] <- mkr_effects
    
  }
  
  cat("\n")
  cat("Cross validation is complete!")
  
  ## Set up which crosses to predict
  if(!is.null(crossing.table)){ ## Used when the user provides a list of specific crosses, parents come from G.in.entries (i.e. not necessarily in TP)
    crossing.table <- as.matrix(crossing.table)
    crossing.mat <- par_position(crossing.table, par.entries = G.entries)$parent.position
    crosses.possible <- par_position(crossing.table, par.entries = G.entries)$crosses.possible
  }
  
  if(is.null(crossing.table) & !is.null(parents) & all(parents == "TP")){ ## Used to just make crosses among the TP -- useful for datasets where different traits have different levels of missing data
    par.G.pos <- match(TP.entries, G.entries)
    crossing.mat <- t(combn(par.G.pos, 2))
    crosses.possible <- par_name(crossing.mat, par.entries = G.entries)  
  }
  
  if(is.null(crossing.table) & all(parents != "TP") & !is.null(parents)){ ## Usd when a list of parent candidates is defined
    par.G.pos <- match(parents, G.entries)
    crossing.mat <- t(combn(par.G.pos, 2))
    crosses.possible <- par_name(crossing.mat, par.entries = G.entries)
  }
  
  if(is.null(crossing.table) & is.null(parents)){ ## If no crosses are specified then all combinations of the parent candidates are considered
    crossing.mat <- t(combn(1:nrow(G.imp), 2))
    crosses.possible <- par_name(crossing.mat, par.entries = G.entries)
  }
  
  
  ## Calculate GEBVs of all entries in G.imp
  par.BVs <- G.imp %*% mkr_effs.mat
  for(b in 1:length(beta.list)){
    beta.tmp <- beta.list[b]
    par.BVs[,b] <- par.BVs[,b] +  beta.tmp
  }
  
  
  ## Generates results dataframe for all traits
  #df.ncol <- (8 + 3*(nTraits-1))
  #if(nTraits == 1) df.ncol <- 11
  df.tmp <- data.frame(cbind(crosses.possible[,1], crosses.possible[,2], matrix(list(rep(NA, times=nSim)), nrow = nrow(crosses.possible), ncol = (8 + 3*(nTraits-1)))))
  names(df.tmp)[1:10] <- c("Par1", "Par2", "midPar.Pheno", "midPar.GEBV", "pred.mu", "pred.mu_sd", "pred.varG", "pred.varG_sd", "mu.sp_low", "mu.sp_high")
  for(n in 1:nTraits){
    if(n == 1) param.dfs <- list()
    param.dfs[[n]] <- as.matrix(df.tmp)
  }; names(param.dfs) <- paste(traits, "_param.df", sep="")
  
  
  ## Start simulation and var.prd process
  cat("\n")
  cat(paste("\nBrewing", nSim, "populations of", nInd, "individuals for each cross... Please be patient", sep=" "))
  cat("\n")
  prog.bar <- txtProgressBar(min=1,  max=(nrow(crossing.mat)*nSim), style=3); p=1
  M <- nInd
  
  for(s in 1:nSim){
    sim.pop <- qtl::sim.cross(map_t1, type="riself", n.ind = M, model=NULL)
    qtl::write.cross(sim.pop, "csv", paste("sim.pop.tmp_", name4out, sep="")) ## NEED to figure out a way to not read it out and in
    pop.mat <- as.matrix(read.csv(paste("sim.pop.tmp_", name4out, ".csv", sep=""), header=T))[3:(M+2), 2:(length(mkr_effects)+1)]
    unlink(paste("sim.pop.tmp_", name4out, ".csv", sep=""))   
    
    for(z in 1:nrow(crossing.mat)){
      setTxtProgressBar(prog.bar, p)
      pop.mat2 <- matrix(NA, nrow=nrow(pop.mat), ncol=ncol(pop.mat))
      
      par1 <- G.imp[crossing.mat[z,1], ]
      par2 <- G.imp[crossing.mat[z,2], ]
      
      ## Assign parent genotypes onto simulated RILs (create a function to do this?)
      for(r in 1:M){
        pop.mat2[r,which(pop.mat[r,]=="A")] <- par1[which(pop.mat[r,]=="A")]
        pop.mat2[r,which(pop.mat[r,]=="B")] <- par2[which(pop.mat[r,]=="B")]
      }
      
      mkr.has.0 <- apply(pop.mat2, 2, function(X){return(length(which(X == 0)))})
      replace.0.mat <- rbind(which(mkr.has.0 != 0), mkr.has.0[which(mkr.has.0 != 0)])
      
      if(ncol(replace.0.mat) > 0){
        for(b in 1:ncol(replace.0.mat)){
          pop.mat2[which(pop.mat2[,replace.0.mat[1,b]] == 0), replace.0.mat[1,b]] <- sample(c(1,-1), size = replace.0.mat[2,b],  replace = T, prob = c(.5,.5))
        }
      }
      
      prog_pred.mat <- pop.mat2 %*% mkr_effs.mat
      for(b in 1:length(beta.list)){
        beta.tmp <- beta.list[b]
        prog_pred.mat[,b] <- prog_pred.mat[,b] +  beta.tmp
      }
      
      for(n in 1:nTraits){
        if(s == 1){
          if(nTraits > 1) colnames(param.dfs[[n]])[11:(10+3*(nTraits-1))]<- c(paste("low.resp_", traits[-n], sep=""), paste("high.resp_", traits[-n], sep=""), paste("cor_w/_", traits[-n], sep=""))
          param.dfs[[n]][[z, "midPar.Pheno"]][s] <- 0.5*(as.numeric(y[crossing.mat[z,1], n+1]) + as.numeric(y[crossing.mat[z,2], n+1]))
          param.dfs[[n]][[z, "midPar.GEBV"]][s] <- 0.5*(as.numeric(par.BVs[crossing.mat[z,1], n]) + as.numeric(par.BVs[crossing.mat[z,2], n]))
        }
        param.dfs[[n]][[z, "pred.mu"]][s] <- mean(prog_pred.mat[, n])
        param.dfs[[n]][[z, "pred.mu_sd"]][s] <- mean(prog_pred.mat[, n])
        param.dfs[[n]][[z, "pred.varG"]][s] <- var(prog_pred.mat[, n])
        param.dfs[[n]][[z, "pred.varG_sd"]][s] <- var(prog_pred.mat[, n])
        param.dfs[[n]][[z, "mu.sp_low"]][s] <- tails(prog_pred.mat[, n], tail.p = tail.p)[2]
        param.dfs[[n]][[z, "mu.sp_high"]][s] <- tails(prog_pred.mat[, n], tail.p = tail.p)[1]
        ## Calculate correlations between traits and correlated response and 
        if(nTraits > 1){
          index <- 1; for(n2 in (1:nTraits)[-n]){
            param.dfs[[n]][[z, 10+index]][s] <- mean(prog_pred.mat[,n2][which(prog_pred.mat[,n] <= quantile(prog_pred.mat[,n], probs = tail.p))], na.rm = T)
            param.dfs[[n]][[z, 10+(nTraits-1)+index]][s] <- mean(prog_pred.mat[,n2][which(prog_pred.mat[,n] >= quantile(prog_pred.mat[,n], probs = 1-tail.p))], na.rm = T)
            param.dfs[[n]][[z, 10+2*(nTraits-1)+index]][s] <- cor(prog_pred.mat[,n], prog_pred.mat[,n2], use = "complete.obs")
            index <- index+1
          }
        }
      }
      
      p <- p+1 # This is the counter for the progress bar
    } ## End of z loop for each cross
  }## End of nSim (s) loop
  
  preds.per.sim <- param.dfs
  
  for(n in 1:nTraits){
    col.names <- colnames(param.dfs[[n]])
    for(c in 3:length(col.names)){
      name.tmp <- col.names[c]
      if(name.tmp %in% c("pred.mu_sd", "pred.varG_sd")) param.dfs[[n]][,c] <- sapply(param.dfs[[n]][,c], FUN = sd, na.rm=T)
      if(!name.tmp %in% c("pred.mu_sd", "pred.varG_sd")) param.dfs[[n]][,c] <- sapply(param.dfs[[n]][,c], FUN = mean, na.rm=T)
    }
  }
  
  if(nTraits == 1) param.dfs <- as.data.frame(param.dfs[[1]])
  
  if(return.raw) return(list(predictions=param.dfs, preds.per.sim=preds.per.sim, CVs=CV.results, models.chosen=best.models, markers.removed=mkrs.removed, entries.removed=entries.removed))
  if(!return.raw) return(list(predictions=param.dfs, CVs=CV.results, models.chosen=best.models, markers.removed=mkrs.removed, entries.removed=entries.removed))
  
} # End of pop.predict



#' Predict genetic variance and genetic correlations in bi-parental populations using a deterministic model
#'
#' @description
#' Generates predictions of the genetic variance and genetic correlation in bi-parental populations using a set of deterministic equations instead of simulations.
#'
#' @param G.in See \code{G.in} in \code{\link[PopVar]{pop.predict}}.
#' @param y.in See \code{y.in} in \code{\link[PopVar]{pop.predict}}.
#' @param map.in See \code{map.in} in \code{\link[PopVar]{pop.predict}}.
#' @param crossing.table See \code{crossing.table} in \code{\link[PopVar]{pop.predict}}.
#' @param parents See \code{parents} in \code{\link[PopVar]{pop.predict}}.
#' @param tail.p See \code{tail.p} in \code{\link[PopVar]{pop.predict}}.
#' @param self.gen The number of selfing generations in the potential cross. Can be an integer or \code{Inf} for
#' recombinant inbreds. Note: \code{self.gen = 1} corresponds to an F2 population.
#' @param DH Indicator if doubled-haploids are to be induced after the number of selfing generations indicated by
#' \code{self.gen}. For example, if \code{self.gen = 0} and \code{DH = TRUE}, then doubled-haploids are assumed
#' to be induced using gametes from F1 plants.
#' @param models See \code{models} in \code{\link[PopVar]{pop.predict}}.
#' @param ... Additional arguments to pass depending on the choice of \code{model}.
#'
#' @details
#'
#' Predictions are based on the deterministic equations specified by Zhong and Jannink (2007), Allier et al. (2019),
#' and Neyhart et al. (2019).
#' 
#' @examples 
#' 
#' \dontrun{
#' 
#' # Load data
#' data("think_barley")
#' 
#' # Use example data to make predictions
#' out <- pop.predict2(G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex, 
#'                     crossing.table = cross.tab_ex)
#'                     
#' # Provide a vector of parents to predict all possible crosses (some parents
#' # have missing phenotypic data)
#' out <- pop.predict2(G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex, 
#'                     parents = y.in_ex$Entry[1:5])
#'                     
#' # Make predictions for 5 crosses with various levels of inbreeding
#' out_list <- lapply(X = 1:10, FUN = function(self.gen) {
#'   out <- pop.predict2(G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex, 
#'                       crossing.table = cross.tab_ex[1:5,], self.gen = self.gen)
#'   out$self.gen <- self.gen
#'   out })
#'                
#' # Plot predictions of grain yield genetic variance over levels of inbreeding
#' dat <- do.call("rbind", lapply(out_list, subset, trait == "Yield"))
#' plot(pred_varG ~ self.gen, data = dat, type = "b", 
#'      subset = parent1 == parent1[1] & parent2 == parent2[1])
#' 
#' }
#'
#' @references
#'
#' Zhong, S., and J.-L. Jannink, 2007 Using quantitative trait loci results to discriminate among crosses on the basis of their
#' progeny mean and variance. Genetics 177: 567–576. https://doi.org/10.1534/ genetics.107.075358
#'
#' Allier, A., L. Moreau, A. Charcosset, S. Teyssèdre, and C. Lehermeier, 2019 Usefulness Criterion and Post-selection Parental
#' Contributions in Multi-parental Crosses: Application to Polygenic Trait Introgression. G3 9: 1469–1479.
#' doi: 10.1534/g3.119.400129
#'
#' Neyhart, J.L., A.J. Lorenz, and K.P. Smith, 2019 Multi-trait Improvement by Predicting Genetic Correlations in Breeding
#' Crosses. G3 9: 3153-3165. doi: 10.1534/g3.119.400406
#'
#'
#' @importFrom qtl mf.h
#' @importFrom rrBLUP mixed.solve
#'
#' @export
#'
pop.predict2 <- function(G.in, y.in, map.in, crossing.table, parents, tail.p = 0.1, self.gen = Inf, DH = FALSE,
                         models = c("rrBLUP", "BayesA", "BayesB","BayesC", "BL", "BRR"), ...) {
  
  
  ## Check classes
  stopifnot(is.data.frame(G.in))
  stopifnot(is.data.frame(y.in))
  stopifnot(is.data.frame(map.in))
  
  # Check the number of markers in the map and the G.in objects
  if (ncol(G.in) - 1 != nrow(map.in))
    stop("The number of markers in G.in does not match the markers in map.in")
  
  ## Make sure there is no missing marker data
  if (any(is.na(G.in))) stop ("There must be no missing marker data.")
  
  # Self gen cannot be negative
  stopifnot(self.gen >= 0)
  
  # DH must be logical
  stopifnot(is.logical(DH))
  
  ## Make sure markers in G.in are in map.in
  markers_Gin <- as.character(unlist(G.in[1,-1, drop = TRUE]))
  markers_mapin <- map.in[,1, drop = TRUE]
  
  if (any(! markers_Gin %in% markers_mapin)) stop("The marker names in G.in are not all in map.in.")
  
  
  ## Filter map.in to remove markers with unknown position
  map.in[[1]] <- as.character(map.in[[1]])
  map.in[[3]] <- as.numeric(map.in[[3]])
  map.in_use <- subset(map.in, !is.na(map.in[[3]]))
  # Reorder based on chrom and then pos
  map.in_use <- map.in_use[order(map.in_use[[2]], map.in_use[[3]]), , drop = FALSE]
  
  # Get the names of the markers in the new map
  markers_mapin <- as.character(map.in_use[[1]])
  
  # Match arguments
  models <- match.arg(models)
  
  ## Subset G.in for markers in map.in_use
  G.in_use <- G.in[, c(1, which(markers_Gin %in% markers_mapin) + 1), drop = FALSE]
  
  
  
  # If the crossing table is not missing, check that the parents are in the G.in input
  if (!missing(crossing.table)) {
    parents <- unique(unlist(crossing.table))
    
    # Make sure the parent names are not factors
    crossing.table <- as.data.frame(sapply(X = crossing.table, as.character), stringsAsFactors = FALSE)
    
  } else {
    if (missing(parents))
      stop("If no crossing.table is provided, a list of parents must be supplied.")
    
    parents <- sort(parents)
    # Create a crossing table with all possible parent combinations
    crossing.table <- as.data.frame(t(combn(x = parents, m = 2)), stringsAsFactors = FALSE)
    names(crossing.table) <- c("parent1", "parent2")
    
  }
  
  if (any(!parents %in% G.in_use[,1,drop = T]))
    stop("Parents are not in G.in.")
  
  ## If self.gen is Inf and DH is T, error
  if (is.infinite(self.gen) & DH) stop("Infinite selfing generations and doubled-haploid production cannot both occur.")
  
  ## Set the factors of line names in y.in to those in the marker df and those in the y.in
  lines_G.in <- as.character(G.in_use[-1,1, drop = TRUE])
  # The levels of the y.in_use geno factor should be the entry names in G.in
  y.in[[1]] <- factor(x = y.in[[1]], levels = sort(lines_G.in))
  # Subset out NAs
  y.in_use <- subset(y.in, !is.na(y.in[[1]]))
  # Reorder lines
  y.in_use <- y.in_use[order(y.in_use[[1]]),, drop = FALSE]
  
  
  
  ## Convert the marker matrix into a useable matrix form
  G.in_pred <- sapply(X = G.in_use[-1, -1], function(x) as.numeric(as.character(x)))
  row.names(G.in_pred) <- as.character(G.in_use[-1,1])
  colnames(G.in_pred) <- as.character(unlist(G.in_use[1,-1]))
  
  # Reorder markers and lines
  M <- G.in_pred[order(row.names(G.in_pred)), markers_mapin]
  
  
  ## Pass data to pop_predict2 ##
  cross_predictions2 <- pop_predict2(M = M, y.in = y.in_use, map.in = map.in_use,
                                     crossing.table = crossing.table, tail.p = tail.p, self.gen = self.gen,
                                     DH = DH, models = models, ... = ...)
  
  
  ## Return the predictions
  return(cross_predictions2)
  
} # Close function




#' @describeIn pop.predict2
#'
#'
#' @param M A Matrix of marker genotypes of dimensions \code{nLine} x \code{nMarker}, coded as
#' -1, 0, and 1.
#' @param marker.effects A data frame of marker effects. The first column should include the marker name and
#' subsequent columns should include the marker effects. Supercedes \code{y.in} if passed.
#'
#' @examples
#' 
#' # Load data
#' data("think_barley")
#' 
#' # Use example data to make predictions
#' out <- pop_predict2(M = G.in_ex_mat, y.in = y.in_ex, map.in = map.in_ex, 
#'                     crossing.table = cross.tab_ex)
#'                     
#' # Provide a vector of parents to predict all possible crosses (some parents
#' # have missing phenotypic data)
#' out <- pop_predict2(M = G.in_ex_mat, y.in = y.in_ex, map.in = map.in_ex, 
#'                     parents = y.in_ex$Entry[1:10])
#'                     
#' # Make predictions for 5 crosses with various levels of inbreeding
#' out_list <- lapply(X = 1:10, FUN = function(self.gen) {
#'   out <- pop_predict2(M = G.in_ex_mat, y.in = y.in_ex, map.in = map.in_ex, 
#'                       crossing.table = cross.tab_ex[1:5,], self.gen = self.gen)
#'   out$self.gen <- self.gen
#'   out })
#'                
#' # Plot predictions of grain yield genetic variance over levels of inbreeding
#' dat <- do.call("rbind", lapply(out_list, subset, trait == "Yield"))
#' plot(pred_varG ~ self.gen, data = dat, type = "b", 
#'      subset = parent1 == parent1[1] & parent2 == parent2[1])
#'
#' @importFrom qtl mf.h
#' @importFrom rrBLUP mixed.solve
#'
#' @export
#'
pop_predict2 <- function(M, y.in, marker.effects, map.in, crossing.table, parents, tail.p = 0.1,
                         self.gen = Inf, DH = FALSE, models = c("rrBLUP", "BayesA", "BayesB","BayesC", "BL", "BRR"),
                         ...) {
  
  
  ###################
  # Error handling
  
  ## Check classes
  stopifnot(is.matrix(M))
  stopifnot(is.data.frame(map.in))
  
  # Check the number of markers in the map and the G.in objects
  if (ncol(M) != nrow(map.in))
    stop("The number of markers in G.in does not match the markers in map.in")
  
  ## Make sure there is no missing marker data
  if (any(is.na(M))) stop ("There must be no missing marker data.")
  
  # Self gen cannot be negative
  stopifnot(self.gen >= 0)
  # DH must be logical
  stopifnot(is.logical(DH))
  
  ## Make sure markers in G.in are in map.in
  stopifnot(!is.null(colnames(M)))
  stopifnot(!is.null(row.names(M)))
  
  markers_M <- colnames(M)
  markers_mapin <- map.in[[1]]
  
  if (any(! markers_M %in% markers_mapin)) stop("The marker names in M are not all in map.in.")
  
  # Get the names of genotyped entries - sort them
  geno_lines <- sort(row.names(M))
  
  # If the crossing table is not missing, check that the parents are in the G.in input
  if (!missing(crossing.table)) {
    parents <- unique(unlist(crossing.table))
    
    # Make sure the parent names are not factors
    crossing.table <- as.data.frame(sapply(X = crossing.table, as.character), stringsAsFactors = FALSE)
    
  } else {
    if (missing(parents))
      stop("If no crossing.table is provided, a list of parents must be supplied.")
    
    parents <- sort(parents)
    # Create a crossing table with all possible parent combinations
    crossing.table <- as.data.frame(t(combn(x = parents, m = 2)), stringsAsFactors = FALSE)
    names(crossing.table) <- c("parent1", "parent2")
    
  }
  
  
  ## Make sure one of y.in or marker.effects are not missing
  if (missing(y.in) & missing(marker.effects))
    stop("You must pass one of 'y.in' or 'marker.effects.'")
  
  ## Error check depending on what is passed
  if (!missing(marker.effects)) {
    
    # Check markers
    markers_maref <- marker.effects[[1]]
    
    if (! all(markers_M %in% markers_maref) ) stop("The marker names in M are not all in marker.effects")
    
    ## Number of traits and trait names
    n_traits <- ncol(marker.effects) - 1
    trait_names <- colnames(marker.effects)[-1]
    
    # Set boolean for later
    calc_marker_eff <- FALSE
    
    # Else check y.in
  } else {
    
    stopifnot(is.data.frame(y.in))
    
    # All phenotyped lines should be genotyped
    if (! all(y.in[[1]] %in% geno_lines) ) stop("All entries in 'y.in' should have marker data in 'M'.")
    
    ## Set the factors of line names in y.in to those in the marker df and those in the y.in
    y.in[[1]] <- factor(x = y.in[[1]], levels = geno_lines)
    # Subset out NAs
    y.in_use <- subset(y.in, !is.na(y.in[[1]]))
    # Reorder lines
    y.in_use <- y.in_use[order(y.in_use[[1]]),, drop = FALSE]
    
    
    ## Number of traits and trait names
    n_traits <- ncol(y.in_use) - 1
    trait_names <- colnames(y.in_use)[-1]
    
    # Set boolean for later
    calc_marker_eff <- TRUE
    
  }
  
  # Match arguments
  models <- match.arg(models)
  
  # Capture additional arguments
  other.args <- list(...)
  
  
  ####################
  
  
  # Reorder map based on chrom and then pos
  map.in_use <- map.in[order(map.in[[2]], map.in[[3]]), , drop = FALSE]
  
  # Get the names of the markers in the new map
  markers_mapin <- as.character(map.in_use[[1]])
  
  
  if (!all(parents %in% row.names(M))) stop("Parents are not in G.in.")
  
  # Reorder markers
  M1 <- M[geno_lines, markers_mapin, drop = FALSE]
  
  ## Fit models to calculate marker effects, if necessary
  if (calc_marker_eff) {
    
    ## Create a model.frame from the phenotypes; extract the column name of genotypes/lines
    geno_colname <- colnames(y.in_use)[1]
    
    # Calculate marker effects
    marker_effect_out <- do.call("calc_marker_effects", 
                                 c(M = quote(M1), y.df = quote(y.in_use[-1]), models = models, 
                                   other.args[names(other.args) %in% formalArgs("calc_marker_effects")]))
    
    ## Create a complete matrix of marker effects for each trait
    mar_eff_mat <- do.call("cbind", lapply(marker_effect_out, "[[", "effects"))
    mar_eff_mat <- structure(mar_eff_mat, dimnames = list(row.names(mar_eff_mat), names(marker_effect_out)))
    
    mar_beta_mat <- do.call("cbind", lapply(marker_effect_out, "[[", "grand_mean"))
    mar_beta_mat <- structure(mar_beta_mat, dimnames = list(row.names(mar_beta_mat), names(marker_effect_out)))
    
    
    # Else create a matrix of marker ordered marker effects
  } else {
    
    # Create matrix
    mar_eff_mat <- as.matrix(marker.effects[,-1,drop = FALSE])
    row.names(mar_eff_mat) <- marker.effects[[1]]
    mar_eff_mat <- mar_eff_mat[markers_mapin,,drop = FALSE]
    
    # Set the grand mean to zero
    mar_beta_mat <- matrix(0, nrow = 1, ncol = ncol(mar_eff_mat),
                           dimnames = list(NULL, colnames(mar_eff_mat)))
    
  }
  
  
  ## Create an empty matrix
  marker_names <- markers_mapin
  covar <- matrix(0, nrow = nrow(map.in_use), ncol = nrow(map.in_use), dimnames = list(marker_names, marker_names))
  
  # Split markers by chromosome
  map.in.chr <- split(map.in_use, map.in_use[,2, drop = FALSE])
  markers_chr <- lapply(map.in.chr, "[[", 1)
  
  # Calculate separate centimorgan distance matrices per chromosome
  chr_cM <- lapply(X = map.in.chr, FUN = function(x) as.matrix(dist(x[,3,drop = FALSE])))
  # Convert to recombination distance
  chr_c <- lapply(X = chr_cM, FUN = mf.h)
  
  
  
  # Calculate the LD covariance
  
  ## This depends on the number of selfing generations
  ## Formulae are taken from Leheremeir et al 2017
  
  # If DH and self.gen = 0, DH's are formed from the F1
  if (DH & self.gen == 0) {
    chr_covar <- lapply(X = chr_c, FUN = function(c) 1 - (2 * c))
    
  } else if (DH & self.gen > 0) {
    covar_selfing <- lapply(X = chr_c, FUN = function(c) {
      base <- 0.5 * (1 - (2 * c))
      Reduce(f = `+`, x = lapply(X = seq(self.gen), FUN = function(k) base ^ k)) })
    
    covar_final <- lapply(X = chr_c, FUN = function(c) (0.5 * (1 - (2 * c))) ^ self.gen)
    
    # Sum
    chr_covar <- mapply(FUN = `+`, covar_selfing, covar_final)
    
  } else if (!DH & is.finite(self.gen)) {
    chr_covar <- lapply(X = chr_c, FUN = function(c) {
      base <- 0.5 * (1 - (2 * c))
      Reduce(f = `+`, x = lapply(X = seq(self.gen), FUN = function(k) base ^ k)) })
    
  } else if (!DH & is.infinite(self.gen)) {
    chr_covar <- lapply(X = chr_c, FUN = function(c) (1 - (2 * c)) / (1 + (2 * c)))
    
  }
  
  
  ## Predicted genotypic value of all genotypes + grand mean
  pgvs <- (M %*% mar_eff_mat) + matrix(mar_beta_mat, ncol = ncol(mar_eff_mat), nrow = nrow(M), byrow = TRUE)
  
  
  ## Calculate the pairwise product of marker effects for each trait, separated by chromosome
  ## Then multiply by the LD covariance
  intra_trait_covar <- list()
  
  # Iterate over traits
  for (i in seq(ncol(mar_eff_mat))) {
    
    # Split the marker effect matrix by chromosome and calculate pairwise product
    mar_eff_prod_chr <- lapply(X = markers_chr, function(marks) tcrossprod(mar_eff_mat[marks, i, drop = F]))
    
    # The covariance is the QTL effect product multiplied by the expected D
    intra_trait_covar[[trait_names[i]]] <- mapply(mar_eff_prod_chr, chr_covar, FUN = `*`, SIMPLIFY = FALSE)
    
  }
  
  
  if (n_traits > 1) {
    
    ## Calculate the pairwise product of marker effects for each pair of traits
    # First create a vector of trait index combinations
    trait_ind_combn <- t(combn(x = seq(n_traits), m = 2, simplify = TRUE))
    trait_combn_name <- combn(x = trait_names, m = 2, paste0, collapse = ":")
    
    # Empty matrix for correlations
    trait_corG_mat <- matrix(data = NA, nrow = n_traits, ncol = n_traits, dimnames = list(trait_names, paste0("cor_w_", trait_names)))
    # Convert to a distance version for easy addition
    trait_corG_dist <- as.dist(trait_corG_mat)
    
    # List of inter-trait covariances
    inter_trait_covar <- list()
    
    # Iterate over trait indices combinations
    for (i in  seq(nrow(trait_ind_combn))) {
      
      # Get the marker effects for that combination
      mar_eff_combn <- mar_eff_mat[,trait_ind_combn[i,], drop = FALSE]
      
      # Calculate the pairwise product by chromosome
      mar_eff_prod_chr <- lapply(X = markers_chr, function(marks) tcrossprod(mar_eff_combn[marks,1,drop = FALSE], mar_eff_combn[marks,2,drop = FALSE]))
      
      # The covariance is the QTL effect product multiplied by the expected D
      inter_trait_covar[[trait_combn_name[i]]] <- mapply(mar_eff_prod_chr, chr_covar, FUN = `*`, SIMPLIFY = FALSE)
      
    }
    
  } else {
    inter_trait_covar <- NULL
    
  }
  
  
  # Determine the k_sp from the the tail.p
  k_sp <- dnorm(x = qnorm(p = 1 - tail.p)) / tail.p
  
  
  ## Verify that all parents are homozygous for all markers
  parents_M <- M1[parents,,drop = FALSE]
  
  # Make sure the markers are coded correctly
  if (!all(parents_M %in% c(-1, 1)))
    stop("This method assumes fully inbred parents. Therefore, marker genotypes other than -1 and 1 are not supported. Please
    edit the marker data and try again.")
  
  ## Create a list to store dfs
  cross_predictions <- vector("list", length = nrow(crossing.table))
  
  ## Iterate over the pairs of parents
  for (j in seq(nrow(crossing.table))) {
    
    # Character vector of the two parents
    pars <- as.character(crossing.table[j,1:2])
    # Cross mean prediction
    pred_mu_j <- colMeans(pgvs[pars,,drop = FALSE])
    # Genetic variance - use the pred_mu_j vector as a template
    pred_varG_j <- pred_mu_j
    
    ## Subset the genotype matrix using the parents
    par_geno <- M1[pars,,drop = FALSE]
    
    # Which markers are segregating?
    mar_seg <- names(which(colMeans(par_geno) == 0))
    # If no markers are segregating, the variance is 0
    if (length(mar_seg) == 0) {
      pred_varG_j[] <- 0
      pred_corG_mat <- if (n_traits > 1) trait_corG_mat else NULL
      pred_cor_musp_low <- pred_cor_musp_high <- if (n_traits > 1) pred_mu_j else NULL
      
    } else {
      
      # Split by chromsome
      mar_seg_chr <- lapply(markers_chr, intersect, mar_seg)
      
      # Find the parent 1 genotype of those markers and take the crossproduct for multiplication
      par1_mar_seg <- crossprod(par_geno[1,mar_seg, drop = FALSE])
      # Split by chromosome
      par1_mar_seg_chr <- lapply(mar_seg_chr, function(marks) par1_mar_seg[marks, marks, drop = FALSE])
      
      
      ## Predictions
      
      # Iterate over traits
      # Note that the QTL covariance matrix includes the variance of each QTL on the diagonal, so the sum of the matrix
      # is the variance + 2 * covariance
      for (i in seq(n_traits)) {
        pred_varG_j[i] <- sum(mapply(par1_mar_seg_chr, intra_trait_covar[[i]], FUN = function(x, y)
          sum(x * y[colnames(x), colnames(x)])))
      }
      
      
      # Genetic correlations between traits, if more than one trait
      if (!is.null(inter_trait_covar)) {
        
        ## Iterate over trait combinations
        for (i in seq_along(inter_trait_covar)) {
          
          ## Calculate the covariance between the pair of traits
          trait_pair_cov <- sum(mapply(par1_mar_seg_chr, inter_trait_covar[[i]], FUN = function(x, y) sum(x * y[colnames(x), colnames(x)])))
          
          # Subset predicted variance
          trait_pair_varG <- pred_varG_j[trait_ind_combn[i,]]
          
          # Calculate correlation and save
          trait_corG_dist[i] <- trait_pair_cov / prod(sqrt(trait_pair_varG))
          
        }
        
        ## Convert distance object to matrix
        pred_corG_mat <- as.matrix(trait_corG_dist)
        dimnames(pred_corG_mat) <- dimnames(trait_corG_mat)
        diag(pred_corG_mat) <- NA
        
        ## Calculate correlated progeny mean
        response_trait_varG <- matrix(pred_varG_j, nrow = length(pred_varG_j), ncol = length(pred_varG_j), byrow = TRUE)
        correlated_response <- k_sp * pred_corG_mat * sqrt(response_trait_varG)
        pred_mu_j_mat <- matrix(pred_mu_j, nrow = length(pred_mu_j), ncol = length(pred_mu_j), byrow = TRUE)
        pred_cor_musp_low <- pred_mu_j_mat - correlated_response
        pred_cor_musp_high <- pred_mu_j_mat + correlated_response
        
        # Change names
        colnames(pred_cor_musp_low) <- paste0("pred_cor_musp_low_", trait_names)
        colnames(pred_cor_musp_high) <- paste0("pred_cor_musp_high_", trait_names)
        
      } else {
        pred_corG_mat <- pred_cor_musp_low <- pred_cor_musp_high <- NULL
        
      }
      
    }
    
    
    ## Save the results as a data.frame
    cross_predictions[[j]] <- data.frame(parent1 = pars[1], parent2 = pars[2], trait = trait_names,
                                         cbind(pred_mu = pred_mu_j, pred_varG = pred_varG_j, pred_corG_mat, pred_cor_musp_low, 
                                               pred_cor_musp_high),
                                         stringsAsFactors = FALSE, row.names = NULL)
    
  } # End loop
  
  ## Bind rows
  cross_predictions1 <- do.call("rbind", cross_predictions)
  
  ## Calculate response predictions (superior progeny, correlated response, etc.)
  pred_response <- (k_sp * sqrt(cross_predictions1$pred_varG))
  
  # Superior progeny mean
  cross_predictions1[["pred_musp_low"]] <- cross_predictions1$pred_mu - pred_response
  cross_predictions1[["pred_musp_high"]] <- cross_predictions1$pred_mu + pred_response
  
  ## Re-order columns
  cor_W_cols <- grep(pattern = paste0(trait_names, collapse = "|"), x = names(cross_predictions1))
  cross_predictions2 <- cross_predictions1[, c(setdiff(seq(ncol(cross_predictions1)), cor_W_cols), cor_W_cols)]
  
  ## Return the predictions
  return(cross_predictions2)
  
} # Close function

Try the PopVar package in your browser

Any scripts or data that you put into this service are public.

PopVar documentation built on Feb. 8, 2021, 1:06 a.m.