node_edge: Create Node-Edge Table

Description Usage Arguments Details Value Examples

View source: R/node_edge.R

Description

Creates a table of nodes and edges based on a language corpus. Edge weights represent the average distance between two words, corrected by their frequency of appearance.

Usage

1
2
3
4
5
6
7
node_edge(
  input,
  maxDist = 4,
  removeStopwords = FALSE,
  binaryPenalty = FALSE,
  showProgress = TRUE
)

Arguments

input

Either a vector of strings, or a model generated by the language_model function

maxDist

The maximum distance to consider two words being co-occurent. Default is 4 (i.e. for "I went to the store," "I" and "store" are 4 words apart)

removeStopwords

If TRUE, words in quanteda stopwords function are excluded from analysis. Defaults to FALSE.

binaryPenalty

If TRUE, edge weights for each word pair will be penalized according to the edge weight of that word pair in the opposing text dataset. See Details.

showProgress

IF TRUE, progress bars are displayed. Defaults to TRUE.

Details

This function quantifies the relationship between words in the provided text.
It computes a measure of inverse average distance between word pairs, and then multiples that by a Dice coefficient (to control for frequency of occurrence and co-occurence of words)
Specifically, the formula used is:

weight = \frac{1}{\bar{D}}*\frac{2*|X \cap Y|}{|X| + |Y|}

where:
\bar{D} = mean distance between word X and word Y
|X \cap Y| = number of co-occurences of word X and word Y
|X| = number of occurences of word X across all pairs
|Y| = number of occurences of word Y across all pairs

If a model predicting a binary outcome variable is provided (and thus two separate word networks will be plotted, for the text corresponding to each variable),
the binaryPenalty argument is available. This penalizes the edge weights for a given network by the strength of the edge weight for the same word pair in the
opposing network. So if the word pair "my house" appears in the text for Outcome 0 with a weight of .33, and it also appears in the text for Outcome 1 with a weight
of .21, applying the binaryPenalty would result in weights of .33*(1 - .21) and .21*(1 - .33). If a word only appears in one Outcome text, its
weight is unmodified.

The output dataframe will contain the following columns:
-**first** and **second** columns: the nodes specifying the two words of the pair
-**inverse_mean_distance**: the mean distance between the word pair, computed as an inverse to give greater weight to words that are closer together (\frac{1}{\bar{D}})
-**cooc_count**: the number of co-occurences of the **first** and **second** words (|X \cap Y|)
-**first_count**: the number of times the **first** word appears in a pair (|X|)
-**second_count**: the number of times the **first** word appears in a pair (|Y|)
-**weight**: the final weight, calculated as above

Value

A dataframe with node and edge weight information, along with occurence counts. See "Details."

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
## Not run: 
movie_review_data1$cleanText = clean_text(movie_review_data1$text)

# Using language to predict "Positive" vs. "Negative" reviews
movie_model_valence = language_model(movie_review_data1,
                                     outcomeVariableColumnName = "valence",
                                     outcomeVariableType = "binary",
                                     textColumnName = "cleanText")

node_edge_table = node_edge(movie_model_valence)

## End(Not run)

nlanderson9/languagePredictR documentation built on June 10, 2021, 11 a.m.