assoc_scores: Association scores used in collocation analysis and keyword...
In wai-wong-reimagine/mclm: Mastering Corpus Linguistics Methods

Description Usage Arguments Details Value Examples

The functions assoc_scores and assoc_abcd take as their arguments co-occurrence frequencies for a number of items, and return a range of association scores used in collocation analysis, collostruction analysis and keywords analysis.

assoc_scores(x, 
             y = NULL, 
             min_freq = 3,
             measures = NULL,
             with_variants = FALSE,
             show_dots = FALSE,
             p_fisher_2 = FALSE,
             small_pos = 0.00001)

assoc_abcd(a, b, c, d,
           measures = NULL,
           with_variants = FALSE,
           show_dots = FALSE,
           p_fisher_2 = FALSE,
           small_pos = 0.00001)

`x`	the argument `x` can either be an object of class `"freqlist"` (i.e. the data type that is returns by the function `freqlist` or an object of class `"cooc_info"` (i.e. the data type that is returns by the functions `surf_cooc` and `text_cooc`. If `x` is of class `"freqlist"`, it is interpreted as the target frequency list (i.e. the list with the frequencies of items in the target context). If `x` is of class `"cooc_info"`, it is interpreted to contain target frequency information, reference frequency information, and corpus size information.
`y`	if `x` is of class `"freqlist"`, then `y` is expected to also be of class `"freqlist"`, and is interpreted as the reference frequency list (i.e. the list with the frequencies of items in the reference context). If `x` is of class `"cooc_info"`, then `"y"` is ignored.
`a`	a vector of numbers that express how many times some target item occurs in the target context. For instance, `a[i]` expresses how many times the `i`-th target item occurs in the target context.
`b`	a vector of numbers that express how many times other items than some target item occur in the target context. For instance, `b[i]` expresses how many times other items than the `i`-th target item occur in the target context.
`c`	a vector of numbers that express how many times some target item occurs in the reference context. For instance, `c[i]` expresses how many times the `i`-th target item occurs in the reference context.
`d`	a vector of numbers that express how many times other items than some target item occur in the reference context. For instance, `d[i]` expresses how many times other items than the `i`-th target item occur in the reference context.
`min_freq`	the minimum value for `a[i]` that is needed for item `i` to be included in the output.
`measures`	a character vector containing the association measures (or related quantities) for which scores are requested. Supported measure names (and related quantities) are `"exp_a"`, `"exp_b"`, `"exp_c"`, `"exp_d"`, `"DP_rows"`, `"DP_cols"`, and many others. The argument `measures` can also have the value `NULL`, which is interpreted as short for the default selection `c("exp_a", "DP_rows", "RR_rows", "OR", "MS", "PMI", "DICE", "G", "chi2", "t", "fisher")`. The argument `measures` can also have the value `"ALL"`, in which case all supported measures are calculated.
`with_variants`	a boolean value that expresses whether for the requested `measures` all variants should be included in the output (`TRUE`) or just the main versions (`FALSE`).
`show_dots`	a boolean value that expresses whether or not a dot (`.`) should be output to the console each time calculations for a measure are finished.
`p_fisher_2`	a boolean value that expresses whether, in case `"fisher"` is one of the requested measures, the p-value for a two-sided test (testing for either attraction or repulsion) should also be calculated. By default, only the (computationally less demanding) p-value for a one-sided test (testing only for attraction) is calculating.
`small_pos`	Several of the association measures break down when one or more of the values `a`, `b`, `c`, and `d` are zero (for instance, because this would lead to division by zero or taking the log of zero). In order to avoid this, a small positive value is systematically added to all zero values for `a`, `b`, `c`, and `d`. The argument `small_pos` determines which small positive value is added in such cases. Its default value is `0.00001`. Adding these small positive values is done systematically, not only when measures are used that need this to be done.

The function assoc_scores takes as its argument a target frequency list and a reference frequency list and returns a number of popular measure that express, for each item in either one of these lists, the extent to which the item is attracted to the target context (when compared to the reference context).

The function assoc_abcd takes as its arguments four vectors a, b, c and d of equal length. Each tuple of values (a[i],b[i],c[i],d[i]), with i some integer number between one and the length of the vectors, is assumed to represent the four numbers a, b, c, d in a contingency table of the type

	target item	other item
target context	`a`	`b`	`m`
reference context	`c`	`d`	`n`
	`k`	`l`	`N`

In the above table m, n, k, l, and N are marginal frequencies. More specifically, m = a + b, n = c + d, k = a + c, l = b + d, and N = m + n.

Returns a data frame with as its rows all items from either the target frequency list or the reference frequency list (or, in case the argument min_freq is non-zero, all items for its frequency in the target frequency list is a least min_freq), and with as its columns a range of measures that express the extent to which the items are attracted to the target context (when compared to the reference context). Some columns don't contain actual measures, but rather additional information that is useful for interpreting certain measures.

The following are (possible) columns in the output:

`a`	The frequency in cell `a`, possibly augmented by `small_pos`. This column is always present.
`b`	The frequency in cell `b`, possibly augmented by `small_pos`. This column is always present.
`c`	The frequency in cell `c`, possibly augmented by `small_pos`. This column is always present.
`d`	The frequency in cell `d`, possibly augmented by `small_pos`. This column is always present.
`dir`	The direction of the association. It contains the value `1` in case of relative attraction between the target item and the target context (i.e. in case a / m ≥ c / n), and it contains the value `-1` in case of relative repulsion between the target item and the target context (i.e. in case a / m < c / n). This column is always present.
`exp_a`	The expected value for the `a` cell, assuming no difference between the contexts. This value is calculated as (m k)/N*. This column is present if `measures` includes either the value `"exp_a"` or the value `"expected"`. It is also present if `measures` is `NULL` or is equal to `"ALL"`.
`exp_b`	The expected value for the `b` cell, assuming no difference between the contexts. This value is calculated as (m l)/N*. This column is present if `measures` includes either the value `"exp_b"` or the value `"expected"`. It is also present if `measures` is equal to `"ALL"`.
`exp_c`	The expected value for the `c` cell, assuming no difference between the contexts. This value is calculated as (n k)/N*. This column is present if `measures` includes either the value `"exp_c"` or the value `"expected"`. It is also present if `measures` is equal to `"ALL"`.
`exp_d`	The expected value for the `d` cell, assuming no difference between the contexts. This value is calculated as (n l)/N*. This column is present if `measures` includes either the value `"exp_d"` or the value `"expected"`. It is also present if `measures` is equal to `"ALL"`.

assoc_abcd(6, 100, 15, 1000)

a <- sample(0:100, 5, replace = TRUE)
b <- 100 - a
c <- sample(0:1000, 5, replace = TRUE)
d <- 1000 - c
scores <- assoc_abcd(a, b, c, d,
                     measures=c("PMI", "t", "fisher"),
                     with_variants = TRUE,
                     p_fisher_2 = TRUE)
round(scores, 3)