p_attribute_encode | R Documentation |
Pure R implementation to generate positional attribute from a character vector of tokens (the token stream).
p_attribute_encode( token_stream, p_attribute = "word", registry_dir, corpus, data_dir, method = c("R", "CWB"), verbose = TRUE, encoding = get_encoding(token_stream), compress = FALSE ) p_attribute_recode( data_dir, p_attribute, from = c("UTF-8", "latin1"), to = c("UTF-8", "latin1") ) p_attribute_rename( corpus, old, new, registry_dir, verbose = TRUE, dryrun = FALSE )
token_stream |
A character vector with the tokens of the corpus. |
p_attribute |
The positional attribute. |
registry_dir |
Registry directory (needed by |
corpus |
The CWB corpus (needed by |
data_dir |
The data directory for the corpus with the binary files. |
method |
Either 'CWB' or 'R'. |
verbose |
Logical. |
encoding |
Encoding as defined in the charset corpus property of the registry file for the corpus ('latin1' to 'latin9', and 'utf8'). |
compress |
Logical. |
from |
Character string describing the current encoding of the attribute. |
to |
Character string describing the target encoding of the attribute. |
old |
A |
new |
A |
dryrun |
A |
Four steps generate the binary CWB corpus data format for positional
attributes: First, encode a character vector (the token stream) using
p_attribute_encode
. Second, create reverse index using
p_attribute_makeall
. Third, compress token stream using
p_attribute_huffcode
. Fourth, compress index files using
p_attribute_compress_rdx
.
The implementation for the first two steps (p_attribute_encode
and
p_attribute_makeall
) is a pure R implementation (so far). These two
steps are enough to use the CQP functionality. To run
p_attribute_huffcode
and p_attribute_compress_rdx
, an
installation of the CWB may be necessary.
See the CQP Corpus Encoding Tutorial (https://cwb.sourceforge.io/files/CWB_Encoding_Tutorial.pdf) for an explanation of the procedure (section 3, “Indexing and compression without CWB/Perl”).
p_attribute_recode
will recode the values in the avs-file and change
the attribute value index in the avx file. The rng-file remains unchanged. The registry
file remains unchanged, and it is highly recommended to consider s_attribute_recode
as a helper for corpus_recode
that will recode all s-attributes, all p-attributes,
and will reset the encoding in the registry file.
Function p_attribute_rename
can be used to rename a
positional attribute. Note that the corpus is not refreshed (unloaded,
re-loaded), so it may be necessary to restart R for changes to become
effective.
Christoph Leonhardt, Andreas Blaette
library(RcppCWB) # In this example, we pursue a "pure R" approach. To rely on the "CWB" # method, you can use the cwb_install() function, which will download and # install the CWB command line # tools within the package. tokens <- readLines(system.file(package = "RcppCWB", "extdata", "examples", "reuters.txt")) # Create new (and empty) directory structure tmpdir <- normalizePath(tempdir(), winslash = "/") registry_tmp <- file.path(tmpdir, "registry", fsep = "/") data_dir_tmp <- file.path(tmpdir, "data_dir", "reuters", fsep = "/") if (file.exists(file.path(data_dir_tmp, "word.corpus"))){ file.remove(file.path(data_dir_tmp, "word.corpus")) } if (dir.exists(registry_tmp)) unlink(registry_tmp, recursive = TRUE) if (dir.exists(data_dir_tmp)) unlink(data_dir_tmp, recursive = TRUE) dir.create(registry_tmp) dir.create(data_dir_tmp, recursive = TRUE) # Now encode token stream p_attribute_encode( corpus = "reuters", token_stream = tokens, p_attribute = "word", data_dir = data_dir_tmp, method = "R", registry_dir = registry_tmp, compress = FALSE, encoding = "utf8" ) # Create minimal registry file regdata <- registry_data( id = "REUTERS", name = "Reuters Sample Corpus", home = data_dir_tmp, properties = c(encoding = "utf-8", language = "en"), p_attributes = "word" ) regfile <- registry_file_write( data = regdata, corpus = "REUTERS", registry_dir = registry_tmp, data_dir = data_dir_tmp, ) # Reload corpus and run query as a test if (cqp_is_initialized()) cqp_reset_registry(registry_tmp) else cqp_initialize(registry_tmp) cqp_query(corpus = "REUTERS", query = '[]{3} "oil" []{3};') regions <- cqp_dump_subcorpus(corpus = "REUTERS") kwic <- apply( regions, 1, function(region){ ids <- cl_cpos2id("REUTERS", "word", registry_tmp, cpos = region[1]:region[2]) words <- cl_id2str(corpus = "REUTERS", p_attribute = "word", registry = registry_tmp, id = ids) paste0(words, collapse = " ") } ) kwic[1:10]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.