Description Usage Arguments Details Value Examples
Creates a table of nodes and edges based on a language corpus. Edge weights represent the average distance between two words, corrected by their frequency of appearance.
1 2 3 4 5 6 7 |
input |
Either a vector of strings, or a model generated by the |
maxDist |
The maximum distance to consider two words being co-occurent. Default is 4 (i.e. for "I went to the store," "I" and "store" are 4 words apart) |
removeStopwords |
If TRUE, words in |
binaryPenalty |
If TRUE, edge weights for each word pair will be penalized according to the edge weight of that word pair in the opposing text dataset. See Details. |
showProgress |
IF TRUE, progress bars are displayed. Defaults to TRUE. |
This function quantifies the relationship between words in the provided text.
It computes a measure of inverse average distance between word pairs, and then multiples that by a Dice coefficient (to control for frequency of occurrence and co-occurence of words)
Specifically, the formula used is:
weight = \frac{1}{\bar{D}}*\frac{2*|X \cap Y|}{|X| + |Y|}
where:
\bar{D} = mean distance between word X and word Y
|X \cap Y| = number of co-occurences of word X and word Y
|X| = number of occurences of word X across all pairs
|Y| = number of occurences of word Y across all pairs
If a model predicting a binary outcome variable is provided (and thus two separate word networks will be plotted, for the text corresponding to each variable),
the binaryPenalty
argument is available. This penalizes the edge weights for a given network by the strength of the edge weight for the same word pair in the
opposing network. So if the word pair "my house" appears in the text for Outcome 0 with a weight of .33, and it also appears in the text for Outcome 1 with a weight
of .21, applying the binaryPenalty
would result in weights of .33*(1 - .21)
and .21*(1 - .33)
. If a word only appears in one Outcome text, its
weight is unmodified.
The output dataframe will contain the following columns:
-**first** and **second** columns: the nodes specifying the two words of the pair
-**inverse_mean_distance**: the mean distance between the word pair, computed as an inverse to give greater weight to words that are closer together (\frac{1}{\bar{D}})
-**cooc_count**: the number of co-occurences of the **first** and **second** words (|X \cap Y|)
-**first_count**: the number of times the **first** word appears in a pair (|X|)
-**second_count**: the number of times the **first** word appears in a pair (|Y|)
-**weight**: the final weight, calculated as above
A dataframe with node and edge weight information, along with occurence counts. See "Details."
1 2 3 4 5 6 7 8 9 10 11 12 | ## Not run:
movie_review_data1$cleanText = clean_text(movie_review_data1$text)
# Using language to predict "Positive" vs. "Negative" reviews
movie_model_valence = language_model(movie_review_data1,
outcomeVariableColumnName = "valence",
outcomeVariableType = "binary",
textColumnName = "cleanText")
node_edge_table = node_edge(movie_model_valence)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.