#' Calculate the permutation distribution for a keyness measure
#'
#' Calculate the permutation distributions of a given keyness measure for each term by
#' shuffling the corpus labels. Number of documents per corpus is kept constant.
#'
#' While usually keyness scores are judged by reference to a limiting null distribution
#' under a token-by-token-sampling model, this implementation approximates the null distribution under
#' a document-by-document sampling model. The permutation distributions of a given keyness measure for
#' each term is calculated by repeatedly shuffling the corpus labels.
#' Number of documents per corpus is kept constant.
#'
#' Currently, the following types of scores are supported:
#' \describe{
#' \item{\code{llr}}{The log-likelihood ratio}
#' \item{\code{chisq}}{The Chi-Square-Statistic}
#' \item{\code{diff}}{Difference of relative frequencies}
#' \item{\code{logratio}}{Binary logarithm of the ratio of the relative frequencies, possibly using a laplace correction to avoid infinite values.}
#' \item{\code{ratio}}{ratio of the relative frequencies, possibly using a laplace correction to avoid infinite values.}
#' }
#'
#' \code{llr} and \code{chisq} are the test-statistics for a two-by-two contingency table.
#' \tabular{rccc}{
#' \tab corpus A \tab corpus B \tab TOTAL\cr
#' term of interest \tab \eqn{o_{11}}{o11} \tab \eqn{o_{12}}{o12} \tab \eqn{r_{1}}{r1}\cr
#' other tokens \tab \eqn{o_{21}}{o21} \tab \eqn{o_{22}}{o22} \tab \eqn{r_{2}}{r2}\cr
#' TOTAL \tab \eqn{c_{1}}{c1} \tab \eqn{c_{2}}{c2} \tab N\cr
#' }
#' Both measure deviations from equal proportions but do not indicate the direction.
#' For \code{llr}, the correct version using terms for all four fields of the table is used,
#' not the version using only two terms that is sometimes used in corpus linguistics:
#' \deqn{llr = -2 * (o11 * log(o11/e11) + o12 * log(o12/e12) +
#' o21 * log(o21/e21) + o22 * log(o22/e22))}
#' where \eqn{oij * log(oij/eij) = 0} if \eqn{oij = 0}.
#'
#' \code{chisq} is the usual Chi-Square statistic for a test of independence / homogeneity:
#' \deqn{chisq = (o11 - e11)^2/e11 + (o12 - e12)^2/e12 +
#' (o21 - e21)^2/e21 + (o22 - e22)^2/e22}
#'
#' Both \code{llr} and \code{chisq} asymptotically follow a Chi-Square-Distribution
#' with 1 degree of freedom if the null hypothesis of equal frequencies in both
#' populations is true and the corpora are drawn iid token-by-token.
#' In contrast, the p-values calculated here are obtained based on a document-by-document
#' sampling model, which is arguably more realistic in many cases.
#'
#' Here, \eqn{oij} are the observed counts as given above and \eqn{eij}
#' are the correpsonding expected values under an independence / homogeneity assumption.
#'
#' \code{diff} and \code{logratio} are measures of the effect size,
#' but using the permutation approach implemented here a p-value can
#' be calculated as well. Both indicate the direction of the effect,
#' and can be used for one- or two-sided tests.
#' \deqn{diff = o11 / c1 - o12 / c2}
#'
#' \code{logratio} is based on a ratio of ratios and would be infinite when a term does not occur in either of the two corpora, irrespective of number of occurences in the other corpus. Hence, we use a laplace correction adding a (not neccesarily integer) number \eqn{k} of ficticious occurences to both corpora:
#' \deqn{logratio = log2( ((o11 + k) / (c1 + k)) / ((o12 + k) / (c2 + k)) )}
#' where \eqn{o11} and \eqn{o12} are the number of occurences of the term of interest in Corpora A and B
#' and \eqn{c1} and \eqn{c2} are the total numbers of tokens in A and B.
#' Setting \eqn{k} to zero corresponds to the usual logratio (which may be
#' infinite). \eqn{k} is given by the \code{laplace} argument and
#' defaults to one, meaning one ficticious occurence is added to
#' either corpus. Doing so prevents infinite values but has little
#' effect when the number of occurences is large.
#'
#' \code{ratio} is the same as \code{logratio} but omits the logarithm:
#' \deqn{ratio = ((o11 + k) / (c1 + k)) / ((o12 + k) / (c2 + k)) }
#' This leads to the same p-values but is faster to compute.
#'
#' @param ifl Indexed frequency list as generated by \code{create_ifl()}.
#' @param observed The vector of observed values of the keyness scores as generarted by \code{keyness_scores()}
#' @param type The type of keyness measure. One of \code{llr}, \code{chisq}, \code{diff}, \code{logratio} or \code{ratio}. See details.
#' @param laplace Parameter of Laplace correction. Only relevant for \code{type = "ratio"} and \code{type = "logratio"}. See details.
#' @param output The type of output. For \code{output = "full"} a matrix with all generated scores is returned,
#' for \code{output = "counts"} a matrix with three columns counting the number of permutations for which the
#' score is strictly smaller than, equal to or strictly larger than the observed value.
#' @param nperm The number of permutations to generate.
#' @return A numeric matrix with number of rows equal to the number of terms. The columns contain either all permutation values
#' of the keyness score (\code{output = "full"}) or the number of permutations for which the
#' score is strictly smaller than, equal to or strictly larger than the observed value (\code{output = "counts"}).
#'
#' @export
keyperm <- function(ifl,
observed,
type = "llr",
laplace = 1.0,
output = "counts",
nperm) {
scoretype <- switch(type,
llr = 1,
chisq = 2,
diff = 3,
logratio = 4,
ratio = 5)
outtype <- switch(output,
full = 1,
counts = 2)
out <- genPerm(ind = ifl$corp_A,
start_vek = ifl$index$start,
nterm = ifl$index$nterms,
freqs = ifl$freqlist$freq,
termlist = ifl$freqlist$term,
rowsums = ifl$rowsums,
colsums = ifl$colsums,
ntotal = ifl$ntotal,
nperm = nperm,
output = outtype,
scoretype = scoretype,
observed = observed,
laplace = laplace
)
class(out) <- paste0("keyperm_results_", output)
attr(out, "scoretype") <- type
attr(out, "output") <- output
if (outtype == 1) {
attr(out, "dimnames") <- list(ifl$terms, NULL)
} else
attr(out, "dimnames") <- list(ifl$terms, c("less", "equal", "greater"))
out
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.