coll_analysis: Association measures for collocation and collostruction...

coll_analysisR Documentation

Association measures for collocation and collostruction analyses

Description

Calculates common association measures used to perform collocation or collostruction analysis for typical count data.

Usage

coll_analysis(.x, ...)

## S3 method for class 'data.frame'
coll_analysis(
  .x,
  o11 = NULL,
  f1 = NULL,
  f2 = NULL,
  n = NULL,
  fun = "ll",
  flip = NULL,
  ...
)

## S3 method for class 'matrix'
coll_analysis(.x, f2 = NULL, n = NULL, fun = "ll", flip = NULL, ...)

## Default S3 method:
coll_analysis(.x, o11, f1, f2 = NULL, n = NULL, fun = "ll", flip = NULL, ...)

Arguments

.x

data.frame or list containing data

...

further arguments to be passed to or from other methods

o11

numeric: joint frequencies

f1

numeric: corpus frequencies of the word

f2

numeric of length 1 or equal to o11: corpus frequencies of co-occurring structure; if omitted, sum of o11 is used

n

numeric of length 1 or equal to o11: corpus or sample size; if omitted, sum(f1 + f2) is used; this might be undesired in the case of collostruction analysis, where corpus size should always be explicitly passed

fun

character vector or named list containing character, function or expression elements: for built-in measures (see Details).

flip

character: names of measures for which to flip the sign for cases with negative association, intended for two-sided measures

Details

For collocation analysis, f1 and f2 typically represent the corpus frequencies of the word and the collocate, respectively, i.e. frequencies of co-occurrence included. For collostruction analysis, f1 represents the corpus frequencies of the word, and f2 the construction frequency. In a contingency table, they represent marginal sums. Both the construction frequency f2 and the corpus size n can be provided as vectors, which allows for efficient calculations over data from multiple constructions/corpora.

For data.frame input, the values for "o11", "f1", "f2", "n" can either be provided explicitly as expression or character argument or implicitly by column name. It is recommended to pass the columns explicitly.

Matrix input currently requires column names "o11", "f1", "f2", "n"

Value

an object similar to .x with one result per column for the association measures specified in fun; row names in matrices and character or factor columns in data.frames are preserved

Examples


data(adjective_cooccurrence)
.x <- subset(adjective_cooccurrence, word != collocate)
n <- attr(adjective_cooccurrence, "corpus_size")
res <- coll_analysis(.x, o11, f1, f2, n, fun = "ll")
res[order(res$ll, decreasing = TRUE), ] |> head()

# if arguments match column names, they can be used explicitly
c("o11", "f1", "f2") %in% names(.x) # TRUE
coll_analysis(.x, n = n, fun = "ll") |>
  head()

# control names of output columns by using a named list
coll_analysis(.x, o11, f1, f2, n, fun = list(logl = "ll")) |>
  head()

# using custom function
mi_base2 <- \(o11, e11) log2(o11 / e11)
coll_analysis(.x, o11, f1, f2, n, fun = mi_base2) |>
  head()

# mix built-in measures with custom functions
coll_analysis(.x, n = n, fun = list(builtin = "ll", custom = mi_base2)) |>
  head()


alex-raw/occurR documentation built on March 10, 2023, 5:08 p.m.