Description Usage Arguments Details Value Note See Also Examples
Construct a sparse document-feature matrix, from a character, corpus, tokens, or even other dfm object.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
x |
character, corpus, tokens, or dfm object |
tolower |
convert all features to lowercase |
stem |
if |
select |
a pattern of user-supplied features to keep, while
excluding all others. This can be used in lieu of a dictionary if there
are only specific features that a user wishes to keep. To extract only
Twitter usernames, for example, set |
remove |
a pattern of user-supplied features to ignore, such as "stop
words". To access one possible list (from any list you wish), use
|
dictionary |
a dictionary object to apply to the tokens when creating the dfm |
thesaurus |
a dictionary object that will be applied as if |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
groups |
either: a character vector containing the names of document
variables to be used for grouping; or a factor or object that can be
coerced into a factor equal in length or rows to the number of documents.
|
verbose |
display messages if |
... |
additional arguments passed to tokens; not used when |
The default behaviour for remove
/select
when constructing ngrams
using dfm(x,
ngrams > 1)
is to remove/select any ngram constructed
from a matching feature. If you wish to remove these before constructing
ngrams, you will need to first tokenize the texts with ngrams, then remove
the features to be ignored, and then construct the dfm using this modified
tokenization object. See the code examples for an illustration.
To select on and match the features of a another dfm, x
must also be a
dfm.
a dfm object
When x
is a dfm, groups
provides a convenient and fast method of
combining and refactoring the documents of the dfm according to the groups.
dfm_select()
, dfm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | ## for a corpus
corp <- corpus_subset(data_corpus_inaugural, Year > 1980)
dfm(corp)
dfm(corp, tolower = FALSE)
# grouping documents by docvars in a corpus
dfm(corp, groups = "President", verbose = TRUE)
# with English stopwords and stemming
dfm(corp, remove = stopwords("english"), stem = TRUE, verbose = TRUE)
# works for both words in ngrams too
tokens("Banking industry") %>%
tokens_ngrams(n = 2) %>%
dfm(stem = TRUE)
# with dictionaries
dict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxing = "taxing",
taxation = "taxation",
taxregex = "tax*",
country = "states"))
dfm(corpus_subset(data_corpus_inaugural, Year > 1900), dictionary = dict)
# removing stopwords
txt <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, with
the newspaper from a boy named Seamus, in his mouth."
corp <- corpus(txt)
# note: "also" is not in the default stopwords("english")
featnames(dfm(corp, select = stopwords("english")))
# for ngrams
featnames(dfm(corp, ngrams = 2, select = stopwords("english"), remove_punct = TRUE))
featnames(dfm(corp, ngrams = 1:2, select = stopwords("english"), remove_punct = TRUE))
# removing stopwords before constructing ngrams
toks1 <- tokens(char_tolower(txt), remove_punct = TRUE)
toks2 <- tokens_remove(toks1, stopwords("english"))
toks3 <- tokens_ngrams(toks2, 2)
featnames(dfm(toks3))
# keep only certain words
dfm(corp, select = "*s") # keep only words ending in "s"
dfm(corp, select = "s$", valuetype = "regex")
# testing Twitter functions
txttweets <- c("My homie @justinbieber #justinbieber shopping in #LA yesterday #beliebers",
"2all the ha8ers including my bro #justinbieber #emabiggestfansjustinbieber",
"Justin Bieber #justinbieber #belieber #fetusjustin #EMABiggestFansJustinBieber")
dfm(txttweets, select = "#*", split_tags = FALSE) # keep only hashtags
dfm(txttweets, select = "^#.*$", valuetype = "regex", split_tags = FALSE)
# for a dfm
dfm(corpus_subset(data_corpus_inaugural, Year > 1980), groups = "Party")
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.