R/PStesting.R

Defines functions PStesting

Documented in PStesting

#' PS testing
#' @description  This is the function to calculate PS (Prediction Strength) scores for a testing data set
#'  with a given PS training object. The selected features with their weights, and two classes' PS score distribution parametes
#'  are extracted from the given PS training object.
#' @details This is the function to calculate PS scores and make classification for a new testing data set, which should be comparable 
#'  to the training data set as much as possible. 
#'  Based on PS algorithm (Golub et al., 1999), standardization is a required step, therefore, 
#'  standardization is included in this function and no option for it.
#'  However, this standardization is only done to make distributions of each selected features comparable. 
#'  Be aware that this feature-wise standardization cannot make the sample-wise distributions comparable. 
#'  For example, the training data set must have two classification groups, but the proportion of 
#'  one group sample might be much less than the other group in the testing data set compared to the 
#'  training data set, or even worse, the testing data set might only contain one classification group only. 
#'  This is the common problem for classification and feature-wise standardization cannot solve the problem. 
#'  In order to solve the problem, we should make data comparable as much as possbile before classification step.
#'  For example, use the same pre-processing settings and make suitable batch effect correction. 
#'  For classification with PS approach, we also suggest to combine traing and testing data together as "newdat"
#'  for this PStesting function, to avoid forcing two groups' classification while there is actual only one group
#'  in the testing group.
#'  
#'  PS score calculation formula is:
#'  \eqn{PS = (V_win − V_lose)/(V_win + V_lose)}
#'  Here, where V_win and V_lose are the vote totals for the winning and losing features/traits for a given sample
#'  
#' @param PStrainObj a PS training object, which is the output from function PStraining
#' @param newdat a new data matrix or data frame, which is comparable to training data set, 
#'  with columns for samples and rows for features
#' @param classProbCut a numeric variable within (0,1), which is a cutoff of Empirical Bayesian probability, 
#'  often used values are 0.8 and 0.9, default value is 0.9. Only one value is used for both groups, 
#'  the samples that are not included in either group will be assigned as UNCLASS
#' @param imputeNA a logic variable to indicate if NA imputation is needed, if it is TRUE, NA imputation is 
#'  processed before any other steps, the default is FALSE
#' @param byrow a logic variable to indicate direction for imputation, default is TRUE, 
#'  which will use the row data for imputation
#' @param imputeValue a character variable to indicate which value to be used to replace NA, default is "median", 
#'  the median value of the chose direction with "byrow" data to be used
#' @return A data frame with PS score and classification
#' @keywords PS
#' @author Aixiang Jiang
#' @references
#' Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: 
#' class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–7
#' @export

PStesting = function(PStrainObj, newdat, classProbCut = 0.9, imputeNA = FALSE, byrow = TRUE, imputeValue = c("median","mean")){

  if(is.null(PStrainObj)){print("Please input your PS training object")}
  PS_pars = PStrainObj$PS_pars
  
  imputeValue = imputeValue[1]
  
  ## impute NA if imputeNA is true
  if(imputeNA){
    newdat = imputeNAs(dataIn = newdat, byrow = byrow, imputeValue = imputeValue)
  }
  
  ### standardize data first
  newdat = standardize(newdat)
  
  ### 20190905, since I changed PS_par in the the output of PStraining, I should make changes accordingly here
  parin = cbind(PS_pars$traits[,1], PS_pars$weights[,1])
  
  # get PS scores for all samples
  PS_score = apply(newdat[rownames(PS_pars$weights),], 2, getPS1sample, PSpars = parin)
  
  testGroup = PStrainObj$classCompare$positive
  
  ### 20190912, change the following line since training object is changed
  refGroup = setdiff(unique(PStrainObj$PS_train$true_class),testGroup)
  
  # for PS, 0 is a natural cutoff for two group classification
  PS_class0 = ifelse(PS_score >= 0, testGroup, refGroup)
  
  ### 20190905, add more to consistent to other functions
  #### use the 0 theoretical cutoff to get two groups, which are used for empirial Bayesian prob calculation
  ttmp = PS_score[which(PS_score >= 0)]
  rtmp = PS_score[which(PS_score < 0)]
  testPSmean = mean(ttmp)
  refPSmean = mean(rtmp)
  testPSsd = sd(ttmp)
  refPSsd = sd(rtmp)
  
  PS_prob_test = getProb(PS_score, groupMeans = c(testPSmean, refPSmean), groupSds = c(testPSsd, refPSsd))
  PS_prob_ref= getProb(PS_score, groupMeans = c(refPSmean, testPSmean), groupSds = c(refPSsd, testPSsd))
  
  PS_class = rep("UNCLASS",length(PS_score))
  PS_class[which(PS_prob_test >= classProbCut)] = testGroup
  PS_class[which(PS_prob_ref >= classProbCut)] = refGroup
  
  PS_score = data.frame(PS_score)
  PS_test = cbind(PS_score, PS_class, PS_prob_test, PS_prob_ref, PS_class0, stringsAsFactors =F)
  
  return(PS_test)
  
}
ajiangsfu/PRPS documentation built on April 29, 2023, 10:13 p.m.