R/ldaPlus.R

Defines functions ldaPlus

Documented in ldaPlus

#' Linear discriminant analysis
#'
#' @description The function performs a linear discriminant analysis (by using the \code{MASS::lda} function).
#' Compared to the \code{MASS::lda} function, the \code{ldaPlus}
#' function enable to consider the prior probabilities to predict the values of a categorical variable, it
#' provides with predicted values and with (Jack-knife) classification table and also with statistical test of canonical correlations
#' between the variable that represents groups and numeric variables.
#' @param x A data frame with values of numeric variables.
#' @param grouping Categorical variable that defines groups.
#' @param pred Whether to return the predicted values based on the model. Default is \code{TRUE}.
#' @param CV Whether to do cross-validation in addition to "ordinary" analysis, default is \code{TRUE}.
#' @param usePriorBetweenGroups Whether to use prior probabilities also in estimating the model (compared to only in prediction); default is \code{TRUE}.
#' @param \dots Arguments passed to function \code{MASS::lda}.
#' @return The following objects are also a part of what is returned by the \code{MASS::lda} function.
#' \itemize{
#' \item \code{prior} - Prior probabilities of class membership taken to estimate the model (it can be estimated based on the sample data or it can be provided by a reseacher).
#' \item \code{counts} - Number of units in each category of categorical variable taken to estimate the model.
#' \item \code{means} - Group means.
#' \item \code{scaling} - Matrix that transforms observations to discriminant functions, normalized so that within groups covariance matrix is spherical.
#' \item \code{lev} - Levels (groups) of the categorical variable.
#' \item \code{svd} - Singular values, that give the ratio of the between-group and within-group standard deviations on linear discriminant variables. Their squares are the canonical F-statistics.
#' \item \code{N} - Number of observations used.
#' \item \code{call} - 	the (matched) function call.
#' }
#' The additional following objects are generated by the \code{multiUS::ldaPlus} function.
#' \itemize{
#' \item \code{standCoefWithin} - Standardized coefficients (within groups) of discriminant function.
#' \item \code{standCoefTotal} - Standardized coefficients of discriminant function.
#' \item \code{betweenGroupsWeights} - Proportions/priors used when estimating the model.
#' \item \code{sigTest} - Test of canonical correlations between the variable that represent groups (binary variable) and numeric variables (see function \code{testCC} for more details) (Ho: The current and all the later canonical correlations equal to zero.).
#' \item \code{eigModel} - Table with eigenvalues and canonical correlations (see function \code{testCC} for more details).
#' \item \code{centroids} - Means of discriminant variables by levels of categorical variable (not predicted, but actual).
#' \item \code{corr} - Pooled correlations within groups (correlations between values of numerical variables and values of linear discriminat function(s)).
#' \item \code{pred}
#' \itemize{
#' \item \code{class} - Predicted values of categorical variable
#' \item \code{posterior} - Posterior probabilities (the values of the Fisher's calcification linear discrimination function)
#' \item \code{x} - Estimated values of discriminat function(s) for each unit
#' }
#' \item \code{class} - Classification table:
#' \itemize{
#' \item \code{orgTab} - Frequency table.
#' \item \code{perTab} - Percentages.
#' \item \code{corPer} - Percentage of correctly predicted values (alternatively, percentage of correctly classified units).
#' }
#' \item \code{classCV} - Similar to \code{class} but based on cross validation (Jack-knife).
#' }
#' @examples
#' ldaPlus(x = mtcars[,c(1, 3, 4, 5, 6)], grouping = mtcars[,10])
#' @details The specified \code{prior} is not taken into account when computing eigenvalues and all statistics based on them (everything in components \code{eigModel} and \code{sigTest} of the returned value).
#' @author Aleš Žiberna
#' @references
#' R Data Analysis Examples: Canonical Correlation Analysis, UCLA: Statistical Consulting Group. From http://www.ats.ucla.edu/stat/r/dae/canonical.htm (accessed Decembar 27, 2013).
#' @export

ldaPlus <- function(x,grouping,pred=TRUE, CV=TRUE,usePriorBetweenGroups=TRUE,...){
  call<-match.call()
  #if(!require(MASS)) stop("MASS Package is required but not installed!")
  LDA<-lda(x=x,grouping=grouping, CV=FALSE,usePriorBetweenGroups=usePriorBetweenGroups,...)
  LDA$call<-call
  n<-dim(x)[1]
  p<-dim(x)[2]
  dummy<-stats::model.matrix(~ grouping)[,-1,drop=FALSE]
  colnames(dummy)<-sub(pattern="^grouping",replacement="",x=colnames(dummy))
  q<-dim(dummy)[2]

  if(!is.null(list(...)$prior) & usePriorBetweenGroups) warning("The specified prior is not taken into account when computing eigenvalues and all statistics based on them (everything in components 'eigModel' and 'sigTest' of the returned value).")
  cca<-stats::cancor(x=x,y=dummy)
  tmp<-testCC(cor=cca$cor, n=n, p=p, q=q)

  res<-c(LDA,tmp)
  class(res)<-class(LDA)

  if(pred){
    res$pred<-predict.ldaPlus(LDA)

    res$centroids<-stats::aggregate(res$pred$x,by = list(grouping),FUN=mean)
    rownames(res$centroids)<-res$centroids[,1]
    res$centroids<-res$centroids[,-1]
    tmp<-by(cbind(x,res$pred$x),INDICES=grouping,FUN=stats::cov,simplify=TRUE)
    tmp<-simplify2array(tmp)
    d1<-dim(tmp)[1]

    tmp<-sweep(tmp,MARGIN = 3,STATS = res$counts-1,FUN = "*")
    covw<-apply(tmp,MARGIN = 1:2,FUN = sum)/(res$N - length(res$counts))
    res$corr<-stats::cov2cor(covw)[-((p+1):d1),((p+1):d1),drop=FALSE]


    tab<-table(orig=grouping,pred=res$pred$class)
    res$class<-list(
      orgTab=stats::addmargins(tab,margin = 1:2),
      # % of correct predictions by groups
      perTab =  stats::addmargins(prop.table(tab,1)*100,margin=2),
      # % of correct predictions
      corPer=sum(diag(prop.table(tab)))*100
    )
  }

  if(CV){
    cv<-lda(x=x,grouping=grouping, CV=TRUE,usePriorBetweenGroups=usePriorBetweenGroups,...)
    tab<-table(orig=grouping,pred=cv$class)
    res$classCV<-list(
      orgTab=stats::addmargins(tab,margin = 1:2),
      # % of correct predictions by groups
      perTab =  stats::addmargins(prop.table(tab,1)*100,margin=2),
      # % of correct predictions
      corPer=sum(diag(prop.table(tab)))*100
    )
  }
  return(res)
}

Try the multiUS package in your browser

Any scripts or data that you put into this service are public.

multiUS documentation built on Jan. 23, 2023, 5:40 p.m.