keyness_measure_calculator: Calculate the keyness measures
In amacanovic/KeynessMeasures: Calculating keyness measures for corpus analysis

View source: R/keyness_measure_calculator.R

keyness_measure_calculator

R Documentation

Calculate the keyness measures

Description

This function takes a dataframe with word frequencies in two corpora (target and reference). Used on the output of the frequency_table_creator. Supports Log Likelihood and Bayes Information Criterion as statistical significance measures and %DIFF, Relative Risk, Log Ratio and Odds Ratio as effect size measures of Keyness. Calculates the keyness in the target corpus versus the reference corpus. The user can specify which measures are calculated. The output data.frame table can be sorted by decreasing or increasing value of any calculated measure.

Usage

keyness_measure_calculator(
  df,
  log_likelihood = TRUE,
  ell = TRUE,
  bic = TRUE,
  perc_diff = TRUE,
  relative_risk = TRUE,
  log_ratio = TRUE,
  odds_ratio = TRUE,
  sort = c("none", "decreasing", "increasing"),
  sort_by = c("log_likelihood", "perc_diff", "bic", "ell", "relative_risk",
    "log_ratio", "odds_ratio")
)

Arguments

`df`	a `data.frame`
`log_likelihood`	logical; if `TRUE`, the log likelihood measure is calculated for each word
`ell`	logical; if `TRUE`, the effect size of the log likelihood measure is calculated for each word
`bic`	logical; if `TRUE`, the Bayes Information Criterion is calculated for each word
`perc_diff`	logical; if `TRUE`, the %DIFF measure is calculated for each word
`relative_risk`	logical; if `TRUE`, the Relative Risk measure is calculated for each word
`log_ratio`	logical; if `TRUE`, the Log Ratio measure is calculated for each word
`odds_ratio`	logical; if `TRUE`, the Odds Ratio measure is calculated for each word
`sort`	if `"none"`, the resulting table of measures will not be sorted; if `"decreasing"`, the table will be sorted starting with the highest value of the measure chosen as the reference for sorting; if `"increasing"`, the table will be sorted starting with the lowest value of the measures chosen as the reference for sorting.
`sort_by`	specifies which measure should be used as a reference for table sorting. If parameter sort is set to `"decreasing"` or `"increasing"` and nothing is specified in this parameter, it will default to `"log_likelihood"`.

Details

Log Likelihood is calculated as specified in Rayson and Garside (2000).
ELL - the Effect Size of Log Likelihood is calculated as per Johnston et al. (2006). This implementation is based on Paul Rayson's formula as per his website.
The Bayes Information Criterion is calculated as proposed in Wilson (2013), the interpretation is available in Gabrielator (2017). This implementation is based on Paul Rayson's formula as per his website.
The %DIFF measure is calculated as per Gabrielatos and Marchi (2012) and Gabrielatos (2018).
The Relative Risk (also known as the Risk Ratio) is disscussed in Kilgarriff (2009). This implementation is based on Paul Rayson's formula as per his website.
The Log Ratio measure is related to the Relative Risk measure (also known as the binary log of the ratio of relative frequencies). Implemented as proposed by Hardie as per this blog article.
The Odds Ratio calculates the occurence of the word in one corpus relative to its occurence in another corpus. This implementation is based on Paul Rayson's formula as per his website.

Value

A dataframe with word frequencies in the target and reference corpora and the chosen keyness measures calculated for the target corpus against the reference corpus.

References

Gabrielatos, C. (2017) Clusters of keyness: A principled approach to selecting key items. In: Corpus Linguistics in the South.
Gabrielatos, C. (2018) Keyness Analysis: nature, metrics and techniques. In: Corpus Approaches to Discourse (pp. 225-258). Routledge.
Gabrielatos, C. and Marchi, A. (2012) Keyness: Appropriate metrics and practical issues. CADS International Conference 2012. Corpus-assisted Discourse Studies: More than the sum of Discourse Analysis and computing?, 13-14 September, University of Bologna, Italy.
Hardie, A (2014) 4. Log ratio – an informal introduction. Post on the website of the ESRC Centre for Corpus Approaches to Social Science CASS. Retrieved from: http://cass.lancs.ac.uk/?p=1133.
Johnston, J.E., Berry, K.J. and Mielke, P.W. (2006) Measures of effect size for chi-squared and likelihood-ratio goodness-of-fit tests. Perceptual and Motor Skills: Volume 103, Issue , pp. 412-414.
Kilgarriff, A. 2009. Simple maths for keywords. In: Mahlberg, M., González-Díaz, V. & Smith, C. eds. Proceedings of the Corpus Linguistics Conference, CL2009. Liverpool
Rayson, P. and Garside, R. (2000) Comparing corpora using frequency profiling. In proceedings of the workshop on Comparing Corpora, held in conjunction with the 38th annual meeting of the Association for Computational Linguistics (ACL 2000), pp. 1 - 6.
Rayson, P. (2019) Log-likelihood and effect size calculator. Website. Retrieved from: http://ucrel.lancs.ac.uk/llwizard.html.