fcm | R Documentation |
Create a sparse feature co-occurrence matrix, measuring co-occurrences of features within a user-defined context. The context can be defined as a document or a window within a collection of documents, with an optional vector of weights applied to the co-occurrence counts.
fcm(
x,
context = c("document", "window"),
count = c("frequency", "boolean", "weighted"),
window = 5L,
weights = NULL,
ordered = FALSE,
tri = TRUE,
...
)
x |
a tokens, or dfm object from which to generate the feature co-occurrence matrix |
context |
the context in which to consider term co-occurrence:
|
count |
how to count co-occurrences:
|
window |
positive integer value for the size of a window on either side of the target feature, default is 5, meaning 5 words before and after the target feature |
weights |
a vector of weights applied to each distance from
|
ordered |
if |
tri |
if |
... |
not used here |
The function fcm()
provides a very general
implementation of a "context-feature" matrix, consisting of a count of
feature co-occurrence within a defined context. This context, following
Momtazi et. al. (2010), can be defined as the document,
sentences within documents, syntactic relationships between
features (nouns within a sentence, for instance), or according to a
window. When the context is a window, a weighting function is
typically applied that is a function of distance from the target word (see
Jurafsky and Martin 2015, Ch. 16) and ordered co-occurrence of the two
features is considered (see Church & Hanks 1990).
fcm provides all of this functionality, returning a V * V
matrix (where V
is the vocabulary size, returned by
nfeat()
). The tri = TRUE
option will only return the
upper part of the matrix.
Unlike some implementations of co-occurrences, fcm counts feature co-occurrences with themselves, meaning that the diagonal will not be zero.
fcm also provides "boolean" counting within the context of "window", which differs from the counting within "document".
is.fcm(x)
returns TRUE
if and only if its x is an object of
type fcm.
Kenneth Benoit (R), Haiyan Wang (R, C++), Kohei Watanabe (C++)
Momtazi, S., Khudanpur, S., & Klakow, D. (2010). "A comparative study of word co-occurrence for term clustering in language model-based sentence retrieval. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, California, June 2010, 325-328. https://aclanthology.org/N10-1046/
Jurafsky, D. & Martin, J.H. (2018). From Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of September 23, 2018 (Chapter 6, Vector Semantics). Available at https://web.stanford.edu/~jurafsky/slp3/.
Church, K. W. & P. Hanks (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22-29.
# see http://bit.ly/29b2zOA
toks1 <- tokens(c("A D A C E A D F E B A C E D"))
fcm(toks1, context = "window", window = 2)
fcm(toks1, context = "window", count = "weighted", window = 3)
fcm(toks1, context = "window", count = "weighted", window = 3,
weights = c(3, 2, 1), ordered = TRUE, tri = FALSE)
# with multiple documents
toks2 <- tokens(c("a a a b b c", "a a c e", "a c e f g"))
fcm(toks2, context = "document", count = "frequency")
fcm(toks2, context = "document", count = "boolean")
fcm(toks2, context = "window", window = 2)
txt3 <- c("The quick brown fox jumped over the lazy dog.",
"The dog jumped and ate the fox.")
toks3 <- tokens(char_tolower(txt3), remove_punct = TRUE)
fcm(toks3, context = "document")
fcm(toks3, context = "window", window = 3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.