coco: Co-occurrence comparison

Description Usage Arguments Details Value References

View source: R/coco.R

Description

Calculates statistically significant difference in co-occurrence counts.

Usage

1
  coco(A, B, nodes, fdr = 0.01, collocates = NULL)

Arguments

A

A data.frame of co-occurrence counts. See details.

B

A data.frame of co-occurrence counts. See details.

nodes

A character vector of nodes or character string representing a single node.

fdr

The desired level at which to control the False Discovery Rate. Default value is 0.01.

collocates

A character vector of collocates or character string representing a single collocate. The collocates essentially act as a filter on the y column of the returned data structure. collocates should be used to target the testing; reducing the number of tests will reduce the loss of power from the multiple test correction.

Details

This function implements the method described in Hennessey and Wiegand (2017).

A and B are data.frames of the form

1
2
3
4
5
    Classes 'data.frame': ...
     $ x: chr  
     $ y: chr  
     $ H: int  
     $ M: int  

The data.frames encapsulate the co-occurrence counts for the (x, y) term pairs within a corpus. For a description of the columns see the details section of the surface function.

The nodes essentially act as a filter on the A$x and B$x columns. For a description of the use of nodes see Hennessey and Wiegand (2017).

fdr indicates the level at which the False Discovery Rate will be controlled. For a description of the form of FDR used see Benjamini and Hochberg (1995). For a description of the use of FDR in this context see Hennessey and Wiegand (2017). For description of the p_adjusted column in the returned structure see p.adjust.

The returned data structure is a data.table. A data.table is also a data.frame and will behave exactly as such if the data.table library is not loaded.

The returned data.table contains details of all the co-occurrences for which there is evidence of a difference in co-occurrence between the two supplied data sets. The effect size is calculated as the log base 2 of the odds ratio. The effects size and its confidence interval are captured in the effect_size, CI_lower and CI_upper columns. The p_value column contains the non-adjusted p-value from the Fisher's Exact Test. For more details see Hennessey and Wiegand (2017).

For an example of usage see the ‘Proof of Concept’ vignette.

Value

A data.table of the form

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
    Classes ‘data.table’ and 'data.frame': 11 variables:
     $ x           : chr
     $ y           : chr
     $ H_A         : int
     $ M_A         : int
     $ H_B         : int
     $ M_B         : int
     $ effect_size : num
     $ CI_lower    : num
     $ CI_upper    : num
     $ p_value     : num
     $ p_adjusted  : num
     - attr(*, "sorted")= chr  "x" "y"
     - attr(*, ".internal.selfref")=<externalptr> 
     - attr(*, "coco_metadata")=List of 4
      ..$ nodes : chr
      ..$ fdr       : num
      ..$ PACKAGE_VERSION:Classes 'package_version', 'numeric_version'
      .. ..$ : int
      ..$ date      : Date, format: "2016-11-01"

References

Y. Benjamini and Y. Hochberg (1995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57 (1)289–300.

A. Hennessey and V. Wiegand and C. R. Tench and M. Mahlberg (2017) Comparing co-occurrences between corpora. In preparation.


ravingmantis/CorporaCoCo documentation built on March 19, 2018, 9:08 a.m.