surface: Calculate Surface Co-occurrence Counts

Description Usage Arguments Details Value References Examples

View source: R/surface.R

Description

Calculates co-occurrence counts for the supplied vector. For each co-occurrence the maximum possible number of co-occurrences is also calculated.

Usage

1
  surface(x, span, nodes = NULL, collocates = NULL)

Arguments

x

A vector. This is the subject of the co-occurrence counting. See details.

span

A character string defining the co-occurrence span. See details.

nodes

A character vector of nodes or character string representing a single node. The nodes essentially act as a filter on the x column of the returned data structure. Use of nodes will significantly reduce memory usage.

collocates

A character vector of collocates or character string representing a single collocate. The collocates essentially act as a filter on the y column of the returned data structure.

Details

x is assumed to be an ordered vector of tokenized text. No processing will be applied to x prior to the co-occurrence count calculations.

‘surface’ co-occurrence is easiest to describe with an example. The following is a span of '2LR', that is 2 to the left and 2 to the right.

1
2
    ("a", "man", "a", "plan", "a", "cat", "a", "canal", "panama")
          |___________|____|___________|

In this example the term “plan” would co-occur once each with the collocates “man” and “cat”, and twice with the collocate “a”.

Other examples of span:

span = '1L2R'

1
2
    ("a", "man", "a", "plan", "a", "cat", "a", "canal", "panama")
                 |____|____|___________|

span = '2R'

1
2
    ("a", "man", "a", "plan", "a", "cat", "a", "canal", "panama")
                      |____|___________|

NAs can be used to implement co-occurrence barriers eg if two NA characters are inserted into x at each sentence boundary then with span = 2 co-occurrences will not happen across sentences. See Evert (2008) for detailed description of co-occurrence barriers.

For a detailed description of ‘surface’ co-occurrence and the other types of co-occurrence see Evert (2008).

Value

Returns a data.table containing counts for all co-occurrences in x. Note that a data.table is also a data.frame so if the data.table library is not loaded the returned object will behave exactly as a data.frame; however, for large data sets there will be significant performance enhancement offered by exploiting data.table functionality.

The returned object is of the form:

1
2
3
4
5
6
7
    Classes ‘data.table’ and 'data.frame': ...
     $ x: chr
     $ y: chr
     $ H: int
     $ M: int
     - attr(*, "sorted")= chr  "x" "y"
     - attr(*, ".internal.selfref")=<externalptr> 

where H is the number of times x co-occurs with y (think Hits), and M is the number of times x fails to co-occur with y when it could have (think Misses); hence H + M is the maximum number of times that x could have co-occurred with y.

References

S. Evert (2008) Corpora and collocations. Corpus Linguistics: An International Handbook 1212–1248.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
    # =====================
    # surface co-occurrence
    # =====================

    x <- c("a", "man", "a", "plan", "a", "canal", "panama")
    surface(x, span = '2R')

    ##         x      y H M
    ##  1:     a      a 2 4
    ##  2:     a  canal 1 5
    ##  3:     a    man 1 5
    ##  4:     a panama 1 5
    ##  5:     a   plan 1 5
    ##  6: canal panama 1 0
    ##  7:   man      a 1 1
    ##  8:   man   plan 1 1
    ##  9:  plan      a 1 1
    ## 10:  plan  canal 1 1


    # filter on nodes
    surface(x, span = '2R', nodes = c("canal", "man", "plan"))

    ##         x      y H M
    ##  1: canal panama 1 0
    ##  2:   man      a 1 1
    ##  3:   man   plan 1 1
    ##  4:  plan      a 1 1
    ##  5:  plan  canal 1 1

    # filter on nodes and collocates
    surface(x, span = '2R', nodes = c("canal", "man", "plan"), collocates = c("panama", "a"))

    ##         x      y H M
    ##  1: canal panama 1 0
    ##  2:   man      a 1 1
    ##  3:  plan      a 1 1


    # co-occurrence barrier
    x <- c("a", "man", "a", "plan", NA, NA, "a", "canal", "panama")
    surface(x, span = '2R')

    #         x      y H M
    #  1:     a      a 1 4
    #  2:     a  canal 1 4
    #  3:     a    man 1 4
    #  4:     a panama 1 4
    #  5:     a   plan 1 4
    #  6: canal panama 1 0
    #  7:   man      a 1 1
    #  8:   man   plan 1 1

ravingmantis/CorporaCoCo documentation built on March 19, 2018, 9:08 a.m.