Description Arguments Examples

This function calculates the co-occurence of features and returns a network/graph in the igraph format, where nodes are tokens and edges represent the similarity/adjacency of tokens. Co-occurence is calcuated based on how often two tokens co-occurr within a given token distance.

**Usage:**

## R6 method for class tCorpus. Use as tc$method (where tc is a tCorpus object).

1 2 3 | ```
semnet_window(feature, measure = c('con_prob', 'cosine', 'count_directed', 'count_undirected', 'chi2'),
context_level = c('document','sentence'), window.size = 10, direction = '<>',
backbone = F, n.batches = 5, set_matrix_mode = c(NA, 'windowXwindow', 'positionXwindow'))
``` |

`feature` |
The name of the feature column |

`measure` |
The similarity measure. Currently supports: "con_prob" (conditional probability), "cosine" similarity, "count_directed" (i.e number of cooccurrences) and "count_undirected" (same as count_directed, but returned as an undirected network, chi2 (chi-square score)) |

`context_level` |
Determine whether features need to co-occurr within "documents" or "sentences" |

`window.size` |
The token distance within which features are considered to co-occurr |

`direction` |
Determine whether co-occurrence is assymmetricsl ("<>") or takes the order of tokens into account. If direction is '<', then the from/x feature needs to occur before the to/y feature. If direction is '>', then after. |

`backbone` |
If True, add an edge attribute for the backbone alpha |

`n.batches` |
If a number, perform the calculation in batches |

`set_matrix_mode` |
Advanced feature. There are two approaches for calculating window co-occurrence. One is to measure how often a feature occurs within a given token window, which can be calculating by calculating the inner product of a matrix that contains the exact position of features and a matrix that contains the occurrence window. We refer to this as the "positionXwindow" mode. Alternatively, we can measure how much the windows of features overlap, for which take the inner product of two window matrices. By default, semnet_window takes the mode that we deem most appropriate for the similarity measure. Substantially, the positionXwindow approach has the advantage of being very easy to interpret (e.g. how likely is feature "Y" to occurr within 10 tokens from feature "X"?). The windowXwindow mode, on the other hand, has the interesting feature that similarity is stronger if tokens co-occurr more closely together (since then their windows overlap more). Currently, we only use the windowXwindow mode for cosine similarity. By using the set_matrix_mode parameter you can override this. |

1 2 3 4 5 6 7 | ```
text = c('A B C', 'D E F. G H I', 'A D', 'GGG')
tc = create_tcorpus(text, doc_id = c('a','b','c','d'), split_sentences = TRUE)
g = tc$semnet_window('token', window.size = 1)
g
igraph::get.data.frame(g)
## Not run: plot_semnet(g)
``` |

kasperwelbers/corpustools documentation built on Sept. 1, 2018, 1:03 p.m.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.