#' Calculate observed keyness scores
#'
#' Calculates a vector of observed keyness scores for a given pair of corpora.
#'
#' Keyness scores are calculated for an Indexed frequency list from a given pair of corpora
#' as generated by \code{create_ifl()}.
#'
#' Currently, the following types of scores are supported:
#' \describe{
#' \item{\code{llr}}{The log-likelihood ratio}
#' \item{\code{chisq}}{The Chi-Square-Statistic}
#' \item{\code{diff}}{Difference of relative frequencies}
#' \item{\code{logratio}}{Binary logarithm of the ratio of the relative frequencies, possibly using a laplace correction to avoid infinite values.}
#' \item{\code{ratio}}{ratio of the relative frequencies, possibly using a laplace correction to avoid infinite values.}
#' }
#'
#' \code{llr} and \code{chisq} are the test-statistics for a two-by-two contingency table.
#' \tabular{rccc}{
#' \tab corpus A \tab corpus B \tab TOTAL\cr
#' term of interest \tab \eqn{o_{11}}{o11} \tab \eqn{o_{12}}{o12} \tab \eqn{r_{1}}{r1}\cr
#' other tokens \tab \eqn{o_{21}}{o21} \tab \eqn{o_{22}}{o22} \tab \eqn{r_{2}}{r2}\cr
#' TOTAL \tab \eqn{c_{1}}{c1} \tab \eqn{c_{2}}{c2} \tab N\cr
#' }
#' Both measure deviations from equal proportions but do not indicate the direction.
#' For \code{llr}, the correct version using terms for all four fields of the table is used,
#' not the version using only two terms that is sometimes used in corpus linguistics:
#' \deqn{llr = -2 * (o11 * log(o11/e11) + o12 * log(o12/e12) +
#' o21 * log(o21/e21) + o22 * log(o22/e22))}
#' where \eqn{oij * log(oij/eij) = 0} if \eqn{oij = 0}.
#'
#' \code{chisq} is the usual Chi-Square statistic for a test of independence / homogeneity:
#' \deqn{chisq = (o11 - e11)^2/e11 + (o12 - e12)^2/e12 +
#' (o21 - e21)^2/e21 + (o22 - e22)^2/e22}
#'
#' Here, \eqn{oij} are the observed counts as given above and \eqn{eij}
#' are the corresponding expected values under an independence / homogeneity assumption.
#'
#' \code{diff} and \code{logratio} are measures of the effect size,
#' but using the permutation approach implemented here a p-value can
#' be calculated as well. Both indicate the direction of the effect,
#' and can be used for one- or two-sided tests.
#' \deqn{diff = o11 / c1 - o12 / c2}
#'
#' \code{logratio} is based on a ratio of ratios and would be infinite when a term does not occur in either of the two corpora, irrespective of number of occurences in the other corpus. Hence, we use a laplace correction adding a (not neccesarily integer) number \eqn{k} of ficticious occurences to both corpora:
#' \deqn{logratio = log2( ((o11 + k) / (c1 + k)) / ((o12 + k) / (c2 + k)) ) }
#' where \eqn{o11} and \eqn{o12} are the number of occurences of the term of interest in Corpora A and B
#' and \eqn{c1} and \eqn{c2} are the total numbers of tokens in A and B.
#' Setting \eqn{k} to zero corresponds to the usual logratio (which may be
#' infinite). \eqn{k} is given by the \code{laplace} argument and
#' defaults to one, meaning one fictitious occurrence is added to
#' either corpus. Doing so prevents infinite values but has little
#' effect when the number of occurrences is large.
#'
#' \code{ratio} is the same as \code{logratio} but omits the logarithm:
#' \deqn{ratio = ((o11 + k) / (c1 + k)) / ((o12 + k) / (c2 + k)) }
#' This leads to the same p-values but is faster to compute.
#'
#' @param ifl Indexed frequency list as generated by \code{create_ifl()}.
#' @param type The type of keyness measure. One of \code{llr}, \code{chisq}, \code{diff}, \code{logratio} or \code{ratio}. See details.
#' @param laplace Parameter of Laplace correction. Only relevant for \code{type = "ratio"} and \code{type = "logratio"}. See details.
#' @return a numerical vector of the scores, one for each term. Terms are stored in the names attribute.
#'
#' @useDynLib keyperm, .registration = TRUE
#' @export
keyness_scores <- function(ifl,
type = "llr",
laplace = 1.0) {
scoretype <- switch(type,
llr = 1,
chisq = 2,
diff = 3,
logratio = 4,
ratio = 5)
obs_scores <- getScores(ind = ifl$corp_A,
start_vek = ifl$index$start,
nterm = ifl$index$nterms,
freqs = ifl$freqlist$freq,
termlist = ifl$freqlist$term,
rowsums = ifl$rowsums,
colsums = ifl$colsums,
ntotal = ifl$ntotal,
scoretype = scoretype,
laplace = laplace)
names(obs_scores) <- ifl$terms
obs_scores
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.