Description Usage Arguments Details Value Examples
View source: R/dictionary_dtm.R
A dictionary has several groups of words. Sometimes what we want is not the term frequency of this or that single word, but rather the total sum of words that belong to the same group. Given a dictionary, this function can save you a lot of time because it sums up the frequencies of all groups of words and you do not need to do it manually.
1 2 3 4 5 6 7 8 | dictionary_dtm(
x,
dictionary,
type = "dtm",
simple_sum = FALSE,
return_dictionary = FALSE,
checks = TRUE
)
|
x |
an object of class DocumentTermMatrix or TermDocumentMatrix created by
|
dictionary |
a dictionary telling the function how you group the words. It can be a list, matrix, data.frame or character vector. Please see details for how to set this argument. |
type |
if x is a matrix, you have to tell whether it represents a document term matrix or a term document matrix. Character starting with "D" or "d" for document term matrix, and that with "T" or "t" for term document matrix. The default is "dtm". |
simple_sum |
if it is |
return_dictionary |
if |
checks |
The default is |
The argument dictionary
can be set in different ways:
(1) list: if it is a list, each element represents a group of words. The element should be a character vector; if it is not, the function will manage to convert. However, the length of the element should be > 0 and has to contain at least 1 non-NA word.
(2) matrix or data.frame: each entry of the input should be character; if it is not, the function will manage to convert.
At least one of the entries should not be NA
. Each column (not row) represents a group of words.
(3) character vector: it represents one group of words.
(4) Note: you do not need to worry about two same words existing in the same group, because the function
will only count one of them. Neither should you worry about that the words in a certain group do not really
exist in the DTM/TDM, because the function will simply ignore those non-existent words. If none of the words
of that group exists, the group will still appear in the final result, although the total frequencies of that group
are all 0's. By setting return_dictionary = TRUE
, you can see which words do exist.
if return_dictionary = FALSE
, an object of class DocumentTermMatrix or TermDocumentMatrix is
returned; if TRUE
, a list is returned, the 1st element is the DTM/TDM, and the 2nd
element is a named list of words. However, if simple_sum = TRUE
, the DTM/TDM in the above two
situations will be replaced by a vector.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | x <- c(
"Hello, what do you want to drink and eat?",
"drink a bottle of milk",
"drink a cup of coffee",
"drink some water",
"eat a cake",
"eat a piece of pizza"
)
dtm <- corp_or_dtm(x, from = "v", type = "dtm")
D1 <- list(
aa <- c("drink", "eat"),
bb <- c("cake", "pizza"),
cc <- c("cup", "bottle")
)
y1 <- dictionary_dtm(dtm, D1, return_dictionary = TRUE)
#
# NA, duplicated words, non-existent words,
# non-character elements do not affect the
# result.
D2 <-list(
has_na <- c("drink", "eat", NA),
this_is_factor <- factor(c("cake", "pizza")),
this_is_duplicated <- c("cup", "bottle", "cup", "bottle"),
do_not_exist <- c("tiger", "dream")
)
y2 <- dictionary_dtm(dtm, D2, return_dictionary = TRUE)
#
# You can read into a data.frame
# dictionary from a csv file.
# Each column represents a group.
D3 <- data.frame(
aa <- c("drink", "eat", NA, NA),
bb <- c("cake", "pizza", NA, NA),
cc <- c("cup", "bottle", NA, NA),
dd <- c("do", "to", "of", "and")
)
y3 <- dictionary_dtm(dtm, D3, simple_sum = TRUE)
#
# If it is a matrix:
mt <- t(as.matrix(dtm))
y4 <- dictionary_dtm(mt, D3, type = "t", return_dictionary = TRUE)
|
CHECKING ARGUMENTS
PROCESSING CHARACTER VECTOR
GENERATING CORPUS
PROCESSING CORPUS
MAKING DTM/TDM
DONE
Warning messages:
1: In Sys.setlocale(category = "LC_COLLATE", s_right_locale) :
OS reports request to set locale to "zh_CN.UTF-8" cannot be honored
2: In Sys.setlocale(category = "LC_CTYPE", s_right_locale) :
OS reports request to set locale to "zh_CN.UTF-8" cannot be honored
3: In tm_map.SimpleCorpus(corp, tm::removePunctuation) :
transformation drops documents
4: In tm_map.SimpleCorpus(corp, tm::removeNumbers) :
transformation drops documents
5: In tm_map.SimpleCorpus(corp, tm::content_transformer(tolower)) :
transformation drops documents
6: In tm_map.SimpleCorpus(corp, tm::stripWhitespace) :
transformation drops documents
CHECKING ARGUMENTS
COMPUTING
MAKING DTM
DONE
Warning messages:
1: In Sys.setlocale(category = "LC_COLLATE", s_right_locale) :
OS reports request to set locale to "zh_CN.UTF-8" cannot be honored
2: In Sys.setlocale(category = "LC_CTYPE", s_right_locale) :
OS reports request to set locale to "zh_CN.UTF-8" cannot be honored
CHECKING ARGUMENTS
---found NA in group 1
COMPUTING
MAKING DTM
DONE
Warning messages:
1: In Sys.setlocale(category = "LC_COLLATE", s_right_locale) :
OS reports request to set locale to "zh_CN.UTF-8" cannot be honored
2: In Sys.setlocale(category = "LC_CTYPE", s_right_locale) :
OS reports request to set locale to "zh_CN.UTF-8" cannot be honored
CHECKING ARGUMENTS
---found NA in group 1
---found NA in group 2
---found NA in group 3
COMPUTING
MAKING DTM
DONE
Warning messages:
1: In Sys.setlocale(category = "LC_COLLATE", s_right_locale) :
OS reports request to set locale to "zh_CN.UTF-8" cannot be honored
2: In Sys.setlocale(category = "LC_CTYPE", s_right_locale) :
OS reports request to set locale to "zh_CN.UTF-8" cannot be honored
CHECKING ARGUMENTS
---found NA in group 1
---found NA in group 2
---found NA in group 3
COMPUTING
MAKING TDM
DONE
Warning messages:
1: In Sys.setlocale(category = "LC_COLLATE", s_right_locale) :
OS reports request to set locale to "zh_CN.UTF-8" cannot be honored
2: In Sys.setlocale(category = "LC_CTYPE", s_right_locale) :
OS reports request to set locale to "zh_CN.UTF-8" cannot be honored
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.