N-Gram Based Text Categorization

Share:

Description

Categorize texts by computing their n-gram profiles, and finding the closest category n-gram profile.

Usage

1
2
textcat(x, p = textcat::TC_char_profiles, method = "CT", ...,
        options = list())

Arguments

x

a character vector of texts, or an R object which can be coerced to this using as.character, or a textcat profile db (see textcat_profile_db) created using the same method and options as p.

p

a textcat profile db. By default, the TextCat character profiles are used (see TC_char_profiles).

method

a character string specifying a built-in method, or a user-defined function for computing distances between n-gram profiles. See textcat_xdist for details.

...

options to be passed to the method for computing distances between profiles.

options

a list of such options.

Details

For each given text, its n-gram profile is computed using the options in the category profile db. Then, the distance between this profile and the category profiles is computed, and the text is categorized into the category of the closest profile (if this is not unique, NA is obtained).

Unless the profile db uses bytes rather than characters, the texts in x should be encoded in UTF-8.

References

W. B. Cavnar and J. M. Trenkle (1994), N-Gram-Based Text Categorization. In “Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval”, 161–175. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.9367

K. Hornik, P. Mair, J. Rauch, W. Geiger, C. Buchta and I. Feinerer (2013). The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52/6, 1–17. http://www.jstatsoft.org/v52/i06/.

Examples

1
2
textcat(c("This is an english sentence.",
          "Das ist ein deutscher satz."))