textcat | R Documentation |
Categorize texts by computing their n-gram profiles, and finding the closest category n-gram profile.
textcat(x, p = textcat::TC_char_profiles, method = "CT", ..., options = list())
x |
a character vector of texts, or an R object which can be
coerced to this using |
p |
a textcat profile db. By default, the TextCat character
profiles are used (see |
method |
a character string specifying a built-in method, or a
user-defined function for computing distances between n-gram
profiles. See |
... |
options to be passed to the method for computing distances between profiles. |
options |
a list of such options. |
For each given text, its n-gram profile is computed using the
options in the category profile db. Then, the distance between this
profile and the category profiles is computed, and the text is
categorized into the category of the closest profile (if this is not
unique, NA
is obtained).
Unless the profile db uses bytes rather than characters, the texts in
x
should be encoded in UTF-8.
W. B. Cavnar and J. M. Trenkle (1994), N-Gram-Based Text Categorization. In “Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval”, 161–175. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.9367
K. Hornik, P. Mair, J. Rauch, W. Geiger, C. Buchta and I. Feinerer (2013). The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52/6, 1–17. doi: 10.18637/jss.v052.i06.
textcat(c("This is an english sentence.", "Das ist ein deutscher satz."))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.