| textcat_profile_db | R Documentation |
Create n-gram profile dbs for text categorization.
textcat_profile_db(x, id = NULL, method = NULL, ...,
options = list(), profiles = NULL)
x |
a character vector of text documents, or an R object of text
documents extractable via |
id |
a character vector giving the categories of the texts to be
recycled to the length of |
method |
a character string specifying a built-in method, or a
user-defined function for computing distances between |
... |
options to be passed to the method for creating profiles. |
options |
a list of such options. |
profiles |
a textcat profile db object. |
The text documents are split according to the given categories, and
n-gram profiles are computed using the specified method, with
options either those used for creating profiles if this is not
NULL, or by combining the options given in ... and
options and merging with the default profile options specified
by the textcat option profile_options using exact
name matching. The method and options employed for building the db
are stored in the db as attributes "method" and
"options", respectively.
There is a c method for combining profile dbs provided
that these have identical options. There are also a [ method
for subscripting and as.matrix and
as.simple_triplet_matrix methods to
“export” the profiles to a dense matrix or the sparse simple
triplet matrix representation provided by package slam,
respectively.
Currently, the only available built-in method is "textcnt",
which has the following options:
n:A numeric vector giving the numbers of characters or bytes in the
n-gram profiles.
Default: 1 : 5.
split:The regular expression pattern to be used in word splitting.
Default: "[[:space:][:punct:][:digit:]]+".
perl:A logical indicating whether to use Perl-compatible regular expressions in word splitting.
Default: FALSE.
tolower:A logical indicating whether to transform texts to lower case (after word splitting).
Default: TRUE.
reduce:A logical indicating whether a representation of n-grams
more efficient than the one used by Cavnar and Trenkle should be
employed.
Default: TRUE.
useBytes:A logical indicating whether to use byte n-grams rather than
character n-grams.
Default: FALSE.
ignore:a character vector of n-grams to be ignored when computing
n-gram profiles.
Default: "_" (corresponding to a word boundary).
size:The maximal number of n-grams used for a profile.
Default: 1000L.
This method uses textcnt in package tau for
computing n-gram profiles, with n, split,
perl and useBytes corresponding to the respective
textcnt arguments, and option reduce setting argument
marker as needed. N-grams listed in option ignore
are removed, and only the most frequent remaining ones retained, with
the maximal number given by option size.
Unless the profile db uses bytes rather than characters (i.e., option
useBytes is TRUE), text documents in x containing
non-ASCII characters must declare their encoding (see
Encoding), and will be re-encoded to UTF-8.
Note that option n specifies all numbers of characters
or bytes to be used in the profiles, and not just the maximal number:
e.g., taking n = 3 will create profiles only containing
tri-grams.
## Obtain the texts of the standard licenses shipped with R.
files <- dir(file.path(R.home("share"), "licenses"), "^[A-Z]",
full.names = TRUE)
texts <- sapply(files,
function(f) paste(readLines(f), collapse = "\n"))
names(texts) <- basename(files)
## Build a profile db using the same method and options as for building
## the ECIMCI character profiles.
profiles <- textcat_profile_db(texts, profiles = ECIMCI_profiles)
## Inspect the 10 most frequent n-grams in each profile.
lapply(profiles, head, 10L)
## Combine into one frequency table.
tab <- as.matrix(profiles)
tab[, 1 : 10]
## Determine languages.
textcat(profiles, ECIMCI_profiles)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.