s_attribute_encode | R Documentation |
Read, process and write data on structural attributes.
s_attribute_encode( values, data_dir, s_attribute, corpus, region_matrix, method = c("R", "CWB"), registry_dir = Sys.getenv("CORPUS_REGISTRY"), encoding, delete = FALSE, verbose = TRUE ) s_attribute_recode( data_dir, s_attribute, from = c("UTF-8", "latin1"), to = c("UTF-8", "latin1") ) s_attribute_files(s_attribute, data_dir) s_attribute_get_values(s_attribute, data_dir) s_attribute_get_regions(s_attribute, data_dir) s_attribute_merge(x, y) s_attribute_delete(corpus, s_attribute) s_attribute_rename(corpus, old, new, registry_dir, verbose = TRUE)
values |
A character vector with the values of the structural attribute. |
data_dir |
The data directory where to write the files. |
s_attribute |
Atomic character vector, the name of the structural attribute. |
corpus |
A CWB corpus. |
region_matrix |
A two-column |
method |
EWither 'R' or 'CWB'. |
registry_dir |
Path name of the registry directory. |
encoding |
Encoding of the data. |
delete |
Logical, whether a call to |
verbose |
Logical. |
from |
Character string describing the current encoding of the attribute. |
to |
Character string describing the target encoding of the attribute. |
x |
Data defining a first s-attribute, a |
y |
Data defining a second s-attribute, a |
old |
A |
new |
A |
In addition to using CWB functionality, the s_attribute_encode
function includes a pure R implementation to add or modify structural attributes
of an existing CWB corpus.
If the corpus has been loaded/used before,
a new s-attribute may not be available unless RcppCWB::cl_delete_corpus
has been called. Use the argument delete
for calling this function.
s_attribute_recode
will recode the values in the avs-file and change
the attribute value index in the avx file. The rng-file remains unchanged. The registry
file remains unchanged, and it is highly recommended to consider s_attribute_recode
as a helper for corpus_recode
that will recode all s-attributes, all p-attributes,
and will reset the encoding in the registry file.
s_attribute_files
will return a named character vector with
the data files (extensions: "avs", "avx", "rng") in the directory indicated
by data_dir
for the structural attribute s_attribute
.
s_attribute_get_values
is equivalent to performing the CL
function cl_struc2id for all strucs of a structural attribute. It is a
"pure R" operation that is faster than using CL, as it processes entire
files for the s-attribute directly. The return value is a character
vector with all string values for the s-attribute.
s_attribute_get_regions
will return a two-column integer
matrix with regions for the strucs of a given s-attribute. Left corpus
positions are in the first column, right corpus positions in the second
column. The result is equivalent to calling RcppCWB::get_region_matrix for
all strucs of a s-attribute, but may be somewhat faster. It is a "pure R"
function which is fast as it processes files entirely and directly.
s_attribute_merge
combines two tables with regions for
s-attributes checking for intersections that may cause problems. The
heuristic is to keep all non-intersecting annotations and those annotations
that define the same region in object x
and object y
.
Annotations of x
and y
which overlap uncleanly, i.e. without
an identity of the left and the right corpus position ("cpos_left" /
"cpos_right") are dropped. The scenario for using the function is to decode
a s-attribute (using s_attribute_decode
), mix in an additional
annotation, and to re-encode the enhanced s-attribute (using
s_attribute_encode
).
Function s_attribute_delete
is not yet implemented.
Function s_attribute_rename
can be used to rename a
structural attribute.
To decode a structural attribute, see s_attribute_decode
.
require("RcppCWB") registry_tmp <- file.path(normalizePath(tempdir(), winslash = "/"), "cwb", "registry", fsep = "/") data_dir_tmp <- file.path( normalizePath(tempdir(), winslash = "/"), "cwb", "indexed_corpora", "reuters", fsep = "/" ) corpus_copy( corpus = "REUTERS", registry_dir = system.file(package = "RcppCWB", "extdata", "cwb", "registry"), data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters"), registry_dir_new = registry_tmp, data_dir_new = data_dir_tmp ) no_strucs <- cl_attribute_size( corpus = "REUTERS", attribute = "id", attribute_type = "s", registry = registry_tmp ) cpos_list <- lapply( 0L:(no_strucs - 1L), function(i) cl_struc2cpos(corpus = "REUTERS", struc = i, s_attribute = "id", registry = registry_tmp) ) cpos_matrix <- do.call(rbind, cpos_list) s_attribute_encode( values = as.character(1L:nrow(cpos_matrix)), data_dir = data_dir_tmp, s_attribute = "foo", corpus = "REUTERS", region_matrix = cpos_matrix, method = "R", registry_dir = registry_tmp, encoding = "latin1", verbose = TRUE, delete = TRUE ) cl_struc2str( "REUTERS", struc = 0L:(nrow(cpos_matrix) - 1L), s_attribute = "foo", registry = registry_tmp ) unlink(registry_tmp, recursive = TRUE) unlink(data_dir_tmp, recursive = TRUE) avs <- s_attribute_get_values( s_attribute = "id", data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters") ) rng <- s_attribute_get_regions( s_attribute = "id", data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters") ) x <- data.frame( cpos_left = c(1L, 5L, 10L, 20L, 25L), cpos_right = c(2L, 5L, 12L, 21L, 27L), value = c("ORG", "LOC", "ORG", "PERS", "ORG"), stringsAsFactors = FALSE ) y <- data.frame( cpos_left = c(5, 11, 20, 25L, 30L), cpos_right = c(5, 12, 22, 27L, 33L), value = c("LOC", "ORG", "ORG", "ORG", "ORG"), stringsAsFactors = FALSE ) s_attribute_merge(x,y)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.