keyperm: Keyword Analysis Using Permutation Tests

Documented in keyperm

#' Calculate the permutation distribution for a keyness measure
#' 
#' Calculate the permutation distributions of a given keyness measure for each term by 
#' shuffling the corpus labels. Number of documents per corpus is kept constant.    
#' 
#' While usually keyness scores are judged by reference to a limiting null distribution
#' under a token-by-token-sampling model, this implementation approximates the null distribution under
#' a document-by-document sampling model. The permutation distributions of a given keyness measure for 
#' each term is calculated by repeatedly shuffling the corpus labels. 
#' Number of documents per corpus is kept constant.
#' 
#' Currently, the following types of scores are supported:
#' \describe{
#'     \item{\code{llr}}{The log-likelihood ratio}
#'     \item{\code{chisq}}{The Chi-Square-Statistic}
#'     \item{\code{diff}}{Difference of relative frequencies}
#'     \item{\code{logratio}}{Binary logarithm of the ratio of the relative frequencies, possibly using a laplace correction to avoid infinite values.}
#'     \item{\code{ratio}}{ratio of the relative frequencies, possibly using a laplace correction to avoid infinite values.}
#'  }
#' 
#'  \code{llr} and \code{chisq} are the test-statistics for a two-by-two contingency table. 
#' \tabular{rccc}{
#' \tab corpus A   \tab corpus B \tab TOTAL\cr
#' term of interest \tab \eqn{o_{11}}{o11}  \tab \eqn{o_{12}}{o12} \tab \eqn{r_{1}}{r1}\cr
#' other tokens \tab \eqn{o_{21}}{o21}    \tab \eqn{o_{22}}{o22} \tab \eqn{r_{2}}{r2}\cr
#' TOTAL \tab \eqn{c_{1}}{c1}    \tab \eqn{c_{2}}{c2} \tab N\cr
#' }
#' Both measure deviations from equal proportions but do not indicate the direction. 
#' For \code{llr}, the correct version using terms for all four fields of the table is used, 
#' not the version using only two terms that is sometimes used in corpus linguistics:
#' \deqn{llr = -2 * (o11 * log(o11/e11) + o12 * log(o12/e12) + 
#' o21 * log(o21/e21) + o22 * log(o22/e22))}
#' where \eqn{oij * log(oij/eij) = 0} if \eqn{oij = 0}.
#' 
#' \code{chisq} is the usual Chi-Square statistic for a test of independence / homogeneity:
#' \deqn{chisq = (o11 - e11)^2/e11 + (o12 - e12)^2/e12 + 
#' (o21 - e21)^2/e21 + (o22 - e22)^2/e22}
#' 
#' Both \code{llr} and \code{chisq} asymptotically follow a Chi-Square-Distribution 
#' with 1 degree of freedom if the null hypothesis of equal frequencies in both
#' populations is true and the corpora are drawn iid token-by-token. 
#' In contrast, the p-values calculated here are obtained based on a document-by-document 
#' sampling model, which is arguably more realistic in many cases. 
#' 
#' Here, \eqn{oij} are the observed counts as given above and \eqn{eij}
#' are the correpsonding expected values under an independence / homogeneity assumption.   
#' 
#' \code{diff} and \code{logratio} are measures of the effect size, 
#' but using the permutation approach implemented here a p-value can
#' be calculated as well. Both indicate the direction of the effect,
#' and can be used for one- or two-sided tests. 
#' \deqn{diff = o11 / c1 - o12 / c2}
#'
#' \code{logratio} is based on a ratio of ratios and would be infinite when a term does not occur in either of the two corpora, irrespective of number of occurences in the other corpus. Hence, we use a laplace correction adding a (not neccesarily integer) number \eqn{k} of ficticious occurences to both corpora: 
#'  \deqn{logratio = log2( ((o11 + k) / (c1 + k)) / ((o12 + k) / (c2 + k)) )}
#'  where \eqn{o11} and \eqn{o12} are the number of occurences of the term of interest in Corpora A and B 
#'  and \eqn{c1} and \eqn{c2} are the total numbers of tokens in A and B. 
#'  Setting \eqn{k} to zero corresponds to the usual logratio (which may be 
#'  infinite). \eqn{k} is given by the \code{laplace} argument and 
#'  defaults to one, meaning one ficticious occurence is added to 
#'  either corpus. Doing so prevents infinite values but has little 
#'  effect when the number of occurences is large.  
#'  
#'  \code{ratio} is the same as \code{logratio} but omits the logarithm:
#'  \deqn{ratio = ((o11 + k) / (c1 + k)) / ((o12 + k) / (c2 + k)) }
#'  This leads to the same p-values but is faster to compute. 
#'  
#' @param ifl Indexed frequency list as generated by \code{create_ifl()}.
#' @param observed The vector of observed values of the keyness scores as generarted by \code{keyness_scores()}
#' @param type The type of keyness measure. One of \code{llr}, \code{chisq}, \code{diff}, \code{logratio} or \code{ratio}. See details. 
#' @param laplace Parameter of Laplace correction. Only relevant for \code{type = "ratio"} and \code{type = "logratio"}. See details. 
#' @param output The type of output. For \code{output = "full"} a matrix with all generated scores is returned,
#'     for \code{output = "counts"} a matrix with three columns counting the number of permutations for which the 
#'     score is strictly smaller than, equal to or strictly larger than the observed value. 
#' @param nperm The number of permutations to generate.         
#' @return A numeric matrix with number of rows equal to the number of terms. The columns contain either all permutation values
#'     of the keyness score (\code{output = "full"}) or the number of permutations for which the 
#'     score is strictly smaller than, equal to or strictly larger than the observed value (\code{output = "counts"}).  
#'     
#' @export
keyperm <- function(ifl,
                    observed,
                    type = "llr",
                    laplace = 1.0,
                    output = "counts",
                    nperm) {
  scoretype <- switch(type,
                      llr = 1,
                      chisq = 2,
                      diff = 3,
                      logratio = 4,
                      ratio = 5)
  outtype <- switch(output,
                    full = 1,
                    counts = 2)
  out <-  genPerm(ind = ifl$corp_A,
                  start_vek = ifl$index$start,
                  nterm = ifl$index$nterms,
                  freqs = ifl$freqlist$freq,
                  termlist = ifl$freqlist$term,
                  rowsums = ifl$rowsums,
                  colsums = ifl$colsums,
                  ntotal = ifl$ntotal,
                  nperm = nperm,
                  output = outtype,
                  scoretype = scoretype,
                  observed = observed,
                  laplace = laplace
  )
  class(out) <- paste0("keyperm_results_", output)
  attr(out, "scoretype") <- type
  attr(out, "output") <- output
  if (outtype == 1) {
     attr(out, "dimnames") <- list(ifl$terms, NULL)
   } else 
      attr(out, "dimnames") <- list(ifl$terms, c("less", "equal", "greater"))
  out
}