R/auto_locus.R
In HaploCatcher: A Predictive Haplotyping Package

Documented in auto_locus

#' Auto Locus - An Automated Pipeline for Locus Prediction
#'
#' This function takes all other functions of the HaploCatcher package and weaves them into one pipeline that performs locus prediction with minimal user intervention.
#'
#' @param geno_mat An imputed, number-coded, genotypic matrix which has n rows of individuals and m columns of markers. Row names of the matrix should be representative of genotypic IDs and column names should be representative of marker IDs. Missing data is not allowed. Numeric coding of genotypes can vary as long as it remains consistant among markers.
#' @param gene_file A dataframe containing at least three columns labeled as follows: 'Gene', 'FullSampleName', and 'Call'. The 'Gene' column contains the name of the gene for which the observation belongs to. The 'FullSampleName' column contains the genotypic ID which corresponds exactly to the column name in the genotypic matrix. The 'Call' column contains the marker call which corresponds to the gene for that genotype. Other information may be present in this dataframe beyond these columns, but the three listed columns above are obligatory.
#' @param gene_name A character string which matches the name of the gene which you are trying to perform cross validation for. This character string must be present in your gene_file 'Gene' column.
#' @param marker_info A dataframe containing the following three columns: 'Marker', 'Chromosome', and 'BP_Position'. The 'Marker' column contains the names of the marker which are present in the genotypic matrix. The 'Chromosome' column contains the corresponding chromosome/linkage group to which the marker belongs. The 'Position' column contains the physical or centimorgan position of the marker. All markers present in the genotypic matrix must be listed in this dataframe. If physical or centimorgan positions are unavailable for the listed markers, a numeric dummy variable ranging from one to n number of markers may be provided instead.
#' @param chromosome  A character string which matches the name of the chromosome upon which the gene resides. This chromosome name must be present in the marker_info file.
#' @param training_genotypes A character vector of fullsamplenames found in the geno_mat and gene_file. These are lines which will be used in cross validation and to train the model which will predict the lines which have no locus call available.
#' @param testing_genotypes A character vector of fullsamplenames found in the geno_mat. These are lines which will have predictions made for them.
#' @param ncor_markers A numeric variable which represents the number of markers the user want to use in model training. Correlation among markers to the gene call is calculated and the top n markers specified are retained for training. The default setting is 50 markers.
#' @param n_neighbors A numeric variable which represents the number of neighbors to use in KNN. Default is 50.
#' @param cv_percent_testing A numeric variable which ranges such that x|0<x<1. This means that this number can be neither zero nor one. This number represents the percent of the total data available the user wants to retain to validate the model. The default setting is 0.20.
#' @param cv_percent_training A numeric variable which ranges such that x|0<x<1. This means that the number can be neither zero nor one. This number represents the percent of the total data available the user wants to retain for training of the model.The default setting is 0.80.
#' @param n_perms A numeric variable defining the number of permutations to perform. This value may range from one to infinity. Default is 30.
#' @param model_selection_parameter A character string which defines which model performance parameter the user wants to use to select the best performing model. Only two strings are accepted either "kappa" for selection on Cohen's kappa value or "accuracy" for selection on total accuracy. Default is "kappa".
#' @param n_votes A numeric variable defining the number of models to train and predict with. This method will take n number of predictions over multiple iterations, sum across those multiple iterations and provide a prediction which is based on majority rule. The default is 30.
#' @param set_seed A numeric variable that is used to set a seed for reproducible results if the user is running the "locus_trait" function once for use in the "locus_pred" function. If the user wishes to run the function many times with a random seed and decide the outcome by voting, set 'predict_by_vote' to TRUE instead. The default setting is NULL.
#' @param predict_by_vote A logical variable which defines if the user wants predictions generated by voting over several runs of machine models summarized into a majority rule call. Default is FALSE.
#' @param include_hets A logical variable which determines if the user wishes to include heterozygous calls or not. The default setting is FALSE.
#' @param include_models A logical variable which determines if the user wishes to include the trained models in the results object for further testing. Warning: the models are quite large and running this will result in a very large results object. The default setting is FALSE.
#' @param verbose A logical variable which determines if the user wants plots displayed and text feedback from each permutation. Regardless of this parameter, the function will display the name of the gene which is being cross validated and the current progress of the permutations. Default setting is TRUE.
#' @param parallel A logical variable which determines if the user wants the cross validation and voting predictions to be performed in parallel. Default is FALSE. If the user defines that parallel is TRUE, all visual and textual feedback will not be rendered.
#' @param n_cores A numeric variable that is used to set a seed for reproducible results if the user is running the function once for use in the "locus_pred" function. If the user wishes to run the function many times with a random seed and decide the outcome by voting, set 'predict_by_vote' to true instead. The default setting is NULL.
#' @param plot_cv_results A logical variable which determines if the user want the cross validation results to be printed as a 'ggplot' image. Default is TRUE
#'
#' @return This function returns two types of list-of-list objects. If 'predict_by_vote' is set to FALSE (the default setting), then the list will contain four objects: method, cross_validation_results, prediction_model, and predictions. The method object is a character which denotes what method was used. The cross_validation_results object contains the results of the cross-validation. The prediction model contains the information about the prediction model, and predictions contains the predictions for the testing_genotypes. If 'predict_by_vote' is set to TRUE, then the list will contain four objects: method, cross_validation_results, predictions, and consensus_predictions. Both method and cross_validation_results are the same for this option. Predictions contains the prediction of each model ran to vote. The object consensus_predictions contains a consensus prediction of the alleleic state of a genotype based on majority rule and a tablular summary of how many votes were taken per catigory.
#'
#' @export
#'
#' @examples
#'
#' #refer to vignette for an in depth look at the auto_locus function
#' vignette("An_Intro_to_HaploCatcher", package = "HaploCatcher")
#'
#' @importFrom randomForest randomForest
#' @importFrom lattice qq
#' @importFrom ggplot2 ggplot
#' @importFrom caret train
#' @importFrom caret knn3
#' @importFrom foreach %dopar%


auto_locus<-function(geno_mat, #the genotypic matrix
                     gene_file, #the gene compendium file
                     gene_name, #the name of the gene
                     marker_info, #the marker information file
                     chromosome, #name of the chromosome
                     training_genotypes, #vector of genotypes for training
                     testing_genotypes, #vector of genotypes for testing
                     ncor_markers=50,#number of markers to retain
                     n_neighbors=50, #number of neighbors
                     cv_percent_testing=0.2, #percentage of genotypes in the validation set
                     cv_percent_training=0.8, #percentage of genotypes in the training set
                     n_perms=30, #number of permutation
                     model_selection_parameter="kappa", #method for CV model selection
                     n_votes=30,#number of votes
                     set_seed=NULL, #set the seed for reproduction of results
                     predict_by_vote=FALSE, #give predictions by voting
                     include_hets=FALSE, #include hets in the model
                     include_models=FALSE, #include models of cross validation set
                     verbose=TRUE, #produce text output
                     parallel=FALSE, #run cv in parallel
                     n_cores=NULL, #number of cores for parallel
                     plot_cv_results=TRUE){ #plot the results of the cross-validation

  #announce
  if(verbose==TRUE){
    #print
    base::print(base::paste("Performing cross validation analysis for ", gene_name, "...", sep = ""))
  }

  #check and stop
  if(predict_by_vote==FALSE & is.null(set_seed)){
    stop("User must define a seed to proceed with pipeline without voting! Either set 'predict_by_vote' to TRUE or set 'set_seed' to a random intiger!" )
  }

  #pull training population out of the data
  train_geno_mat<-geno_mat[rownames(geno_mat) %in% training_genotypes,]
  train_gene_file<-gene_file[gene_file$FullSampleName %in% training_genotypes,]

  #first perform CV
  fit_cv<-HaploCatcher::locus_perm_cv(n_perms = n_perms, #the number of permutations
                                      geno_mat = train_geno_mat, #the genotypic matrix
                                      gene_file = train_gene_file, #the gene compendium file
                                      gene_name = gene_name, #the name of the gene
                                      marker_info = marker_info, #the marker information file
                                      chromosome = chromosome, #name of the chromosome
                                      ncor_markers = ncor_markers, #number of markers to retain
                                      n_neighbors = n_neighbors, #number of neighbors
                                      percent_testing = cv_percent_testing, #percentage of genotypes in the validation set
                                      percent_training = cv_percent_training, #percentage of genotypes in the training set
                                      include_hets = include_hets, #excludes hets in the model
                                      include_models = include_models, #excludes models in results object
                                      verbose = verbose, #excludes text
                                      parallel = parallel, #should it be in parallel
                                      n_cores = n_cores) #if so, how many cores

  #announce
  if(verbose==TRUE){
    #print
    base::print("Done!")
  }

  #plot results
  if(plot_cv_results==TRUE){
    suppressWarnings(HaploCatcher::plot_locus_perm_cv(fit_cv))
  }

  #check the model selection parameter
  if(!model_selection_parameter=="kappa" & !model_selection_parameter=="accuracy"){
    warning("Neither kappa nor accuracy were set as model selection parameters; defaulting to kappa.")
    model_selection_parameter="kappa"
  }

  #pull out the optimal model
  if(model_selection_parameter=="kappa"){

    #pull model
    model_selected<-fit_cv$Overall_Summary[fit_cv$Overall_Summary$Mean_Kappa==max(fit_cv$Overall_Summary$Mean_Kappa),"Model"]

    if(length(model_selected)>1){
      message("Note: There was a tie between models in terms of kappa, selecting model at random...")
      model_selected=data.frame(a=c("knn", "rf"),
                                rand=stats::rnorm(2, mean = 10000000, sd=30000))
      model_selected=as.character(model_selected[model_selected$rand==max(model_selected$rand), 1])
    }else if(model_selected=="Random Forest"){
      model_selected="rf"
    }else if(model_selected=="K-Nearest Neighbors"){
      model_selected="knn"
    }else{
      message("Note: There was a tie between models in terms of kappa, selecting model at random...")
      model_selected=data.frame(a=c("knn", "rf"),
                                rand=stats::rnorm(2, mean = 10000000, sd=30000))
      model_selected=as.character(model_selected[model_selected$rand==max(model_selected$rand), 1])
    }

    }else if (model_selection_parameter=="accuracy"){

    #pull model
    model_selected<-fit_cv$Overall_Summary[fit_cv$Overall_Summary$Mean_Accuracy==max(fit_cv$Overall_Summary$Mean_Accuracy),"Model"]

    #change name
    if(length(model_selected)>1){
      message("Note: There was a tie between models in terms of kappa, selecting model at random...")
      model_selected=data.frame(a=c("knn", "rf"),
                                rand=stats::rnorm(2, mean = 10000000, sd=30000))
      model_selected=as.character(model_selected[model_selected$rand==max(model_selected$rand), 1])
    }else if (model_selected=="Random Forest"){
      model_selected="rf"
    }else if(model_selected=="K-Nearest Neighbors"){
      model_selected="knn"
    }else{
      message("Note: There was a tie between models in terms of accuracy, selecting model at random...")
      model_selected=data.frame(a=c("knn", "rf"),
                                rand=stats::rnorm(2, mean = 10000000, sd=30000))
      model_selected=as.character(model_selected[model_selected$rand==max(model_selected$rand), 1])
    }
  }

  if(verbose==TRUE){print(paste("The best performing model in cross-validation in terms of ",
                                model_selection_parameter,
                                " was ",
                                ifelse(model_selected=="knn",
                                       "k-nearest neighbors",
                                       "random forest"),
                                ". Moving onto forward prediction!",
                                sep=""))}

  if(predict_by_vote==TRUE){

    if(parallel==TRUE){

      #set cores
      if(is.null(n_cores)){
        n_cores=parallel::detectCores()-1
      }

      #assign cluster
      cluster<-parallel::makeCluster(n_cores)
      doParallel::registerDoParallel(cluster)

      #announce
      if(verbose==TRUE){
        print("Running permutations in parallel! All text output suppressed!")
      }

      #run parallel loop
      results<-foreach::foreach(i=1:n_votes) %dopar% {

        #train model
        fit<-HaploCatcher::locus_train(geno_mat = train_geno_mat, #the genotypic matrix
                                       gene_file = train_gene_file, #the gene compendium file
                                       gene_name = gene_name, #the name of the gene
                                       marker_info = marker_info, #the marker information file
                                       chromosome = chromosome, #name of the chromosome
                                       ncor_markers = ncor_markers,#number of markers to retain
                                       n_neighbors = n_neighbors, #number of neighbors
                                       include_hets = include_hets, #include hets in the model
                                       verbose = verbose, #allows for text and graph output
                                       set_seed = set_seed, #sets a seed for reproduction of results
                                       models_request = model_selected) #sets what models are requested
        #make a prediction
        pred<-HaploCatcher::locus_pred(locus_train_results=fit,
                                       geno_mat=geno_mat,
                                       genotypes_to_predict=testing_genotypes)
        return(pred)

      }

      #stop the cluster
      parallel::stopCluster(cluster)

      #re-register sequential
      foreach::registerDoSEQ()

      a<-data.frame(FullSampleName=results[[1]]$FullSampleName)

      for(i in 1:n_votes){

        a[,paste("Vote_",i,sep = "")]=results[[i]][,2]

      }

      b<-c()

      for(i in 1:nrow(a)){
        c<-as.data.frame(table(t(a[i,2:ncol(a)])))
        d<-c
        e<-levels(a[,2])
        e<-data.frame(Var1=e)
        d<-merge(d,e, all=T)
        d$Freq=ifelse(is.na(d$Freq), 0,d$Freq)
        d<-t(d)
        colnames(d)=d[1,]
        d<-d[-1,]
        d<-as.data.frame(t(d))
        rownames(d)=NULL
        c<-data.frame(FullSampleName=a[i,1],
                      Consensus_Call=ifelse(length(c[c$Freq==max(c$Freq), "Var1"])>1, "No_Call", as.character(c[c$Freq==max(c$Freq), "Var1"])),
                      d)
        b<-rbind(b,c)
        remove(c,d,e)
      }

      #make results object
      return_results<-list(method="multiple models - majority rule",
                           cross_validation_results=fit_cv,
                           predictions=a,
                           consensus_predictions=b)
      return(return_results)

    }else if(parallel==FALSE){

      results<-list()

      for(i in 1:n_votes){
        #train model
        fit<-HaploCatcher::locus_train(geno_mat = train_geno_mat, #the genotypic matrix
                                       gene_file = train_gene_file, #the gene compendium file
                                       gene_name = gene_name, #the name of the gene
                                       marker_info = marker_info, #the marker information file
                                       chromosome = chromosome, #name of the chromosome
                                       ncor_markers = ncor_markers, #number of markers to retain
                                       n_neighbors = n_neighbors, #number of neighbors
                                       include_hets = include_hets, #include hets in the model
                                       verbose = verbose, #allows for text and graph output
                                       set_seed = set_seed, #sets a seed for reproduction of results
                                       models_request = model_selected) #sets what models are requested
        #make a prediction
        pred<-HaploCatcher::locus_pred(locus_train_results=fit,
                                       geno_mat=geno_mat,
                                       genotypes_to_predict=testing_genotypes)
        results[[i]]<-pred
      }

      a<-data.frame(FullSampleName=results[[1]]$FullSampleName)

      for(i in 1:n_votes){

        a[,paste("Vote_",i,sep = "")]=results[[i]][,2]

      }

      b<-c()

      for(i in 1:nrow(a)){
        c<-as.data.frame(table(t(a[i,2:ncol(a)])))
        d<-c
        e<-levels(a[,2])
        e<-data.frame(Var1=e)
        d<-merge(d,e, all=T)
        d$Freq=ifelse(is.na(d$Freq), 0,d$Freq)
        d<-t(d)
        colnames(d)=d[1,]
        d<-d[-1,]
        d<-as.data.frame(t(d))
        rownames(d)=NULL
        c<-data.frame(FullSampleName=a[i,1],
                      Consensus_Call=ifelse(length(c[c$Freq==max(c$Freq), "Var1"])>1, "No_Call", as.character(c[c$Freq==max(c$Freq), "Var1"])),
                      d)
        b<-rbind(b,c)
        remove(c,d,e)
      }

      #make results object
      return_results<-list(method="multiple models - majority rule",
                           cross_validation_results=fit_cv,
                           predictions=a,
                           consensus_predictions=b)

      return(return_results)
    }

  }else if(predict_by_vote==FALSE){

    #train model
    fit<-HaploCatcher::locus_train(geno_mat = train_geno_mat, #the genotypic matrix
                                   gene_file = train_gene_file, #the gene compendium file
                                   gene_name = gene_name, #the name of the gene
                                   marker_info = marker_info, #the marker information file
                                   chromosome = chromosome, #name of the chromosome
                                   ncor_markers = ncor_markers, #number of markers to retain
                                   n_neighbors = n_neighbors, #number of neighbors
                                   include_hets = include_hets, #include hets in the model
                                   verbose = verbose, #allows for text and graph output
                                   set_seed = set_seed, #sets a seed for reproduction of results
                                   models_request = model_selected) #sets what models are requested
    #make a prediction
    pred<-HaploCatcher::locus_pred(locus_train_results=fit,
                                   geno_mat=geno_mat,
                                   genotypes_to_predict=testing_genotypes)

    return_results<-list(method="single model - single prediction",
                         cross_validation_results=fit_cv,
                         prediction_model=fit,
                         predictions=pred)

    return(return_results)

  }else{

    stop("The function parameter 'predict_by_vote' must be a logical argument (TRUE/FALSE)!")

  }
}