textcat_profile_db | R Documentation |
Create n-gram profile dbs for text categorization.
textcat_profile_db(x, id = NULL, method = NULL, ..., options = list(), profiles = NULL)
x |
a character vector of text documents, or an R object of text
documents extractable via |
id |
a character vector giving the categories of the texts to be
recycled to the length of |
method |
a character string specifying a built-in method, or a
user-defined function for computing distances between n-gram
profiles, or |
... |
options to be passed to the method for creating profiles. |
options |
a list of such options. |
profiles |
a textcat profile db object. |
The text documents are split according to the given categories, and
n-gram profiles are computed using the specified method, with
options either those used for creating profiles
if this is not
NULL
, or by combining the options given in ...
and
options
and merging with the default profile options specified
by the textcat option profile_options
using exact
name matching. The method and options employed for building the db
are stored in the db as attributes "method"
and
"options"
, respectively.
There is a c
method for combining profile dbs provided
that these have identical options. There are also a [
method
for subscripting and as.matrix
and
as.simple_triplet_matrix
methods to
“export” the profiles to a dense matrix or the sparse simple
triplet matrix representation provided by package slam,
respectively.
Currently, the only available built-in method is "textcnt"
,
which has the following options:
n
:A numeric vector giving the numbers of characters or bytes in the n-gram profiles.
Default: 1 : 5
.
split
:The regular expression pattern to be used in word splitting.
Default: "[[:space:][:punct:][:digit:]]+"
.
perl
:A logical indicating whether to use Perl-compatible regular expressions in word splitting.
Default: FALSE
.
tolower
:A logical indicating whether to transform texts to lower case (after word splitting).
Default: TRUE
.
reduce
:A logical indicating whether a representation of n-grams more efficient than the one used by Cavnar and Trenkle should be employed.
Default: TRUE
.
useBytes
:A logical indicating whether to use byte n-grams rather than character n-grams.
Default: FALSE
.
ignore
:a character vector of n-grams to be ignored when computing n-gram profiles.
Default: "_"
(corresponding to a word boundary).
size
:The maximal number of n-grams used for a profile.
Default: 1000L
.
This method uses textcnt
in package tau for
computing n-gram profiles, with n
, split
,
perl
and useBytes
corresponding to the respective
textcnt
arguments, and option reduce
setting argument
marker
as needed. N-grams listed in option ignore
are removed, and only the most frequent remaining ones retained, with
the maximal number given by option size
.
Unless the profile db uses bytes rather than characters (i.e., option
useBytes
is TRUE
), text documents in x
containing
non-ASCII characters must declare their encoding (see
Encoding
), and will be re-encoded to UTF-8.
Note that option n
specifies all numbers of characters
or bytes to be used in the profiles, and not just the maximal number:
e.g., taking n = 3
will create profiles only containing
tri-grams.
## Obtain the texts of the standard licenses shipped with R. files <- dir(file.path(R.home("share"), "licenses"), "^[A-Z]", full.names = TRUE) texts <- sapply(files, function(f) paste(readLines(f), collapse = "\n")) names(texts) <- basename(files) ## Build a profile db using the same method and options as for building ## the ECIMCI character profiles. profiles <- textcat_profile_db(texts, profiles = ECIMCI_profiles) ## Inspect the 10 most frequent n-grams in each profile. lapply(profiles, head, 10L) ## Combine into one frequency table. tab <- as.matrix(profiles) tab[, 1 : 10] ## Determine languages. textcat(profiles, ECIMCI_profiles)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.