s_attribute_encode | R Documentation |
Read, process and write data on structural attributes.
s_attribute_encode(
values,
data_dir,
s_attribute,
corpus,
region_matrix,
method = c("R", "CWB"),
registry_dir = Sys.getenv("CORPUS_REGISTRY"),
encoding,
delete = FALSE,
verbose = TRUE
)
s_attribute_recode(
data_dir,
s_attribute,
from = c("UTF-8", "latin1"),
to = c("UTF-8", "latin1")
)
s_attribute_files(s_attribute, data_dir)
s_attribute_get_values(s_attribute, data_dir)
s_attribute_get_regions(s_attribute, data_dir)
s_attribute_merge(x, y)
s_attribute_delete(corpus, s_attribute)
s_attribute_rename(corpus, old, new, registry_dir, verbose = TRUE)
values |
A |
data_dir |
The data directory where to write the files. |
s_attribute |
Name of the structural attribute, an atomic |
corpus |
A CWB corpus. |
region_matrix |
A two-column |
method |
Either 'R' or 'CWB'. |
registry_dir |
Path name of the registry directory. |
encoding |
Encoding of the data. |
delete |
Logical, whether to call |
verbose |
Logical. |
from |
Character string describing the current encoding of the attribute. |
to |
Character string describing the target encoding of the attribute. |
x |
Data defining a first s-attribute, a |
y |
Data defining a second s-attribute, a |
old |
A |
new |
A |
s_attribute_encode()
implements a 'pure R' implementation to add
or modify structural attributes of an existing CWB corpus.
If the corpus has been loaded/used before, a new s-attribute may not be
available unless RcppCWB::cl_delete_corpus()
has been called. Use the
argument delete
for calling this function.
s_attribute_recode
will recode the values in the avs-file and change
the attribute value index in the avx file. The rng-file remains unchanged. The registry
file remains unchanged, and it is highly recommended to consider s_attribute_recode
as a helper for corpus_recode
that will recode all s-attributes, all p-attributes,
and will reset the encoding in the registry file.
s_attribute_files()
will return a named character vector with
the data files (extensions: "avs", "avx", "rng") in the directory indicated
by data_dir
for the structural attribute s_attribute
.
s_attribute_get_values()
is equivalent to performing the CL
function cl_struc2id for all strucs of a structural attribute. It is a
"pure R" operation that is faster than using CL, as it processes entire
files for the s-attribute directly. The return value is a character
vector with all string values for the s-attribute.
s_attribute_get_regions
will return a two-column integer
matrix with regions for the strucs of a given s-attribute. Left corpus
positions are in the first column, right corpus positions in the second
column. The result is equivalent to calling RcppCWB::get_region_matrix for
all strucs of a s-attribute, but may be somewhat faster. It is a "pure R"
function which is fast as it processes files entirely and directly.
s_attribute_merge()
combines two tables with regions for
s-attributes checking for intersections that may cause problems. The
heuristic is to keep all non-intersecting annotations and those annotations
that define the same region in object x
and object y
.
Annotations of x
and y
which overlap uncleanly, i.e. without
an identity of the left and the right corpus position ("cpos_left" /
"cpos_right") are dropped. The scenario for using the function is to decode
a s-attribute (using s_attribute_decode()
), mix in an additional
annotation, and to re-encode the enhanced s-attribute (using
s_attribute_encode()
).
Function s_attribute_delete()
is not yet implemented.
Function s_attribute_rename()
can be used to rename a structural
attribute.
To decode a structural attribute, see
s_attribute_decode
.
require("RcppCWB")
registry_tmp <- fs::path(tempdir(), "cwb", "registry")
data_dir_tmp <- fs::path(tempdir(), "cwb", "indexed_corpora", "reuters")
cwb_dir_rcppcwb <- system.file(package = "RcppCWB", "extdata", "cwb")
registry_dir_rcppcwb <- fs::path(cwb_dir_rcppcwb, "registry")
data_dir_rcppcwb <- fs::path(cwb_dir_rcppcwb,"indexed_corpora", "reuters")
corpus_copy(
corpus = "REUTERS",
registry_dir = registry_dir_rcppcwb,
data_dir = data_dir_rcppcwb,
registry_dir_new = registry_tmp,
data_dir_new = data_dir_tmp
)
no_strucs <- cl_attribute_size(
corpus = "REUTERS",
attribute = "id",
attribute_type = "s",
registry = registry_tmp
)
cpos_matrix <- get_region_matrix(
corpus = "REUTERS",
struc = 0L:(no_strucs - 1L),
s_attribute = "id",
registry = registry_tmp
)
s_attribute_encode(
values = 1L:nrow(cpos_matrix),
data_dir = data_dir_tmp,
s_attribute = "article_id",
corpus = "REUTERS",
region_matrix = cpos_matrix,
method = "R",
registry_dir = registry_tmp,
encoding = "latin1",
verbose = TRUE,
delete = TRUE
)
cl_struc2str(
"REUTERS",
struc = 0L:(nrow(cpos_matrix) - 1L),
s_attribute = "article_id",
registry = registry_tmp
)
unlink(registry_tmp, recursive = TRUE)
unlink(data_dir_tmp, recursive = TRUE)
data_dir <- system.file(
package = "RcppCWB",
"extdata",
"cwb",
"indexed_corpora",
"reuters"
)
avs <- s_attribute_get_values(s_attribute = "id", data_dir = data_dir)
rng <- s_attribute_get_regions(
s_attribute = "id",
data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters")
)
x <- data.frame(
cpos_left = c(1L, 5L, 10L, 20L, 25L),
cpos_right = c(2L, 5L, 12L, 21L, 27L),
value = c("ORG", "LOC", "ORG", "PERS", "ORG"),
stringsAsFactors = FALSE
)
y <- data.frame(
cpos_left = c(5, 11, 20, 25L, 30L),
cpos_right = c(5, 12, 22, 27L, 33L),
value = c("LOC", "ORG", "ORG", "ORG", "ORG"),
stringsAsFactors = FALSE
)
s_attribute_merge(x,y)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.