# surface: Calculate Surface Co-occurrence Counts In ravingmantis/CorporaCoCo: Corpora Co-Occurrence Comparison

## Description

Calculates co-occurrence counts for the supplied vector. For each co-occurrence the maximum possible number of co-occurrences is also calculated.

## Usage

 `1` ``` surface(x, span, nodes = NULL, collocates = NULL) ```

## Arguments

 `x` A vector. This is the subject of the co-occurrence counting. See details. `span` A character string defining the co-occurrence span. See details. `nodes` A `character vector` of nodes or `character string` representing a single `node`. The nodes essentially act as a filter on the x column of the returned data structure. Use of `nodes` will significantly reduce memory usage. `collocates` A `character vector` of collocates or `character string` representing a single collocate. The collocates essentially act as a filter on the y column of the returned data structure.

## Details

x is assumed to be an ordered vector of tokenized text. No processing will be applied to `x` prior to the co-occurrence count calculations.

‘surface’ co-occurrence is easiest to describe with an example. The following is a `span` of `'2LR'`, that is 2 to the left and 2 to the right.

 ```1 2``` ``` ("a", "man", "a", "plan", "a", "cat", "a", "canal", "panama") |___________|____|___________| ```

In this example the term “plan” would co-occur once each with the collocates “man” and “cat”, and twice with the collocate “a”.

Other examples of `span`:

`span = '1L2R'`

 ```1 2``` ``` ("a", "man", "a", "plan", "a", "cat", "a", "canal", "panama") |____|____|___________| ```

`span = '2R'`

 ```1 2``` ``` ("a", "man", "a", "plan", "a", "cat", "a", "canal", "panama") |____|___________| ```

`NA`s can be used to implement co-occurrence barriers eg if two `NA` characters are inserted into x at each sentence boundary then with `span = 2` co-occurrences will not happen across sentences. See Evert (2008) for detailed description of co-occurrence barriers.

For a detailed description of ‘surface’ co-occurrence and the other types of co-occurrence see Evert (2008).

## Value

Returns a `data.table` containing counts for all co-occurrences in x. Note that a `data.table` is also a `data.frame` so if the `data.table` library is not loaded the returned object will behave exactly as a `data.frame`; however, for large data sets there will be significant performance enhancement offered by exploiting `data.table` functionality.

The returned object is of the form:

 ```1 2 3 4 5 6 7``` ``` Classes ‘data.table’ and 'data.frame': ... \$ x: chr \$ y: chr \$ H: int \$ M: int - attr(*, "sorted")= chr "x" "y" - attr(*, ".internal.selfref")= ```

where `H` is the number of times `x` co-occurs with `y` (think Hits), and `M` is the number of times `x` fails to co-occur with `y` when it could have (think Misses); hence `H + M` is the maximum number of times that `x` could have co-occurred with `y`.

## References

S. Evert (2008) Corpora and collocations. Corpus Linguistics: An International Handbook 1212–1248.

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52``` ``` # ===================== # surface co-occurrence # ===================== x <- c("a", "man", "a", "plan", "a", "canal", "panama") surface(x, span = '2R') ## x y H M ## 1: a a 2 4 ## 2: a canal 1 5 ## 3: a man 1 5 ## 4: a panama 1 5 ## 5: a plan 1 5 ## 6: canal panama 1 0 ## 7: man a 1 1 ## 8: man plan 1 1 ## 9: plan a 1 1 ## 10: plan canal 1 1 # filter on nodes surface(x, span = '2R', nodes = c("canal", "man", "plan")) ## x y H M ## 1: canal panama 1 0 ## 2: man a 1 1 ## 3: man plan 1 1 ## 4: plan a 1 1 ## 5: plan canal 1 1 # filter on nodes and collocates surface(x, span = '2R', nodes = c("canal", "man", "plan"), collocates = c("panama", "a")) ## x y H M ## 1: canal panama 1 0 ## 2: man a 1 1 ## 3: plan a 1 1 # co-occurrence barrier x <- c("a", "man", "a", "plan", NA, NA, "a", "canal", "panama") surface(x, span = '2R') # x y H M # 1: a a 1 4 # 2: a canal 1 4 # 3: a man 1 4 # 4: a panama 1 4 # 5: a plan 1 4 # 6: canal panama 1 0 # 7: man a 1 1 # 8: man plan 1 1 ```

ravingmantis/CorporaCoCo documentation built on March 19, 2018, 9:08 a.m.