Description Usage Arguments Details Value Note See Also Examples
This function selects or removes features from a dfm or fcm,
based on feature name matches with pattern
. The most common usages
are to eliminate features from a dfm already constructed, such as stopwords,
or to select only terms of interest from a dictionary.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | dfm_select(
x,
pattern = NULL,
selection = c("keep", "remove"),
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
min_nchar = NULL,
max_nchar = NULL,
verbose = quanteda_options("verbose")
)
dfm_remove(x, ...)
dfm_keep(x, ...)
fcm_select(
x,
pattern = NULL,
selection = c("keep", "remove"),
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
verbose = quanteda_options("verbose"),
...
)
fcm_remove(x, pattern = NULL, ...)
fcm_keep(x, pattern = NULL, ...)
|
x |
the dfm or fcm object whose features will be selected |
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
selection |
whether to |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
min_nchar, max_nchar |
optional numerics specifying the minimum and
maximum length in characters for tokens to be removed or kept; defaults are
|
verbose |
if |
... |
used only for passing arguments from |
dfm_remove
and fcm_remove
are simply a convenience
wrappers to calling dfm_select
and fcm_select
with
selection = "remove"
.
dfm_keep
and fcm_keep
are simply a convenience wrappers to
calling dfm_select
and fcm_select
with selection = "keep"
.
A dfm or fcm object, after the feature selection has been applied.
For compatibility with earlier versions, when pattern
is a
dfm object and selection = "keep"
, then this will be
equivalent to calling dfm_match()
. In this case, the following
settings are always used: case_insensitive = FALSE
, and
valuetype = "fixed"
. This functionality is deprecated, however, and
you should use dfm_match()
instead.
This function selects features based on their labels. To select
features based on the values of the document-feature matrix, use
dfm_trim()
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | dfmat <- dfm(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?"),
tolower = FALSE)
dict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
wordsEndingInY = c("by", "my"),
notintext = "blahblah"))
dfm_select(dfmat, pattern = dict)
dfm_select(dfmat, pattern = dict, case_insensitive = FALSE)
dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex")
dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex")
dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed")
dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed")
# select based on character length
dfm_select(dfmat, min_nchar = 5)
dfmat <- dfm(c("This is a document with lots of stopwords.",
"No if, and, or but about it: lots of stopwords."))
dfmat
dfm_remove(dfmat, stopwords("english"))
toks <- tokens(c("this contains lots of stopwords",
"no if, and, or but about it: lots"),
remove_punct = TRUE)
fcmat <- fcm(toks)
fcmat
fcm_remove(fcmat, stopwords("english"))
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.