p_attribute_encode: Encode Positional Attribute(s).

View source: R/p_attribute.R

p_attribute_encodeR Documentation

Encode Positional Attribute(s).

Description

Pure R implementation to generate positional attribute from a character vector of tokens (the token stream).

Usage

p_attribute_encode(
  token_stream,
  p_attribute = "word",
  registry_dir,
  corpus,
  data_dir,
  method = c("R", "CWB"),
  verbose = TRUE,
  encoding = get_encoding(token_stream),
  compress = FALSE
)

p_attribute_recode(
  data_dir,
  p_attribute,
  from = c("UTF-8", "latin1"),
  to = c("UTF-8", "latin1")
)

p_attribute_rename(
  corpus,
  old,
  new,
  registry_dir,
  verbose = TRUE,
  dryrun = FALSE
)

Arguments

token_stream

A character vector with the tokens of the corpus. The maximum length is 2 147 483 647 (2^31 - 1); a warning is issued if this threshold is exceeded. See the CWB Encoding Tutorial for size limitations of corpora. May also be a file.

p_attribute

The positional attribute. May be more than one, if method is "CWB". If method is "R", only one positional attribute may be supplied.

registry_dir

Registry directory (needed by p_attribute_huffcode() and p_attribute_compress_rdx()).

corpus

The CWB corpus (needed by p_attribute_huffcode() and p_attribute_compress_rdx()).

data_dir

The data directory for the corpus with the binary files.

method

Either 'CWB' or 'R'.

verbose

A logical value.

encoding

Encoding as defined in the charset corpus property of the registry file for the corpus ('latin1' to 'latin9', and 'utf8').

compress

A logical value.

from

Character string describing the current encoding of the attribute.

to

Character string describing the target encoding of the attribute.

old

A character vector with p-attributes to be renamed.

new

A character vector with new names of p-attributes. The vector needs to have the same length as vector old.

dryrun

A logical value, whether to suppress actual renaming operation for inspecting output messages

Details

Four steps generate the binary CWB corpus data format for positional attributes: First, encode a character vector (the token stream) using p_attribute_encode. Second, create reverse index using p_attribute_makeall. Third, compress token stream using p_attribute_huffcode. Fourth, compress index files using p_attribute_compress_rdx.

The implementation for the first two steps (p_attribute_encode() and p_attribute_makeall()) is a pure R implementation (so far). These two steps are enough to use the CQP functionality. To run p_attribute_huffcode() and p_attribute_compress_rdx(), an installation of the CWB may be necessary.

See the CQP Corpus Encoding Tutorial (https://cwb.sourceforge.io/files/CWB_Encoding_Tutorial.pdf) for an explanation of the procedure (section 3, “Indexing and compression without CWB/Perl”).

p_attribute_recode will recode the values in the avs-file and change the attribute value index in the avx file. The rng-file remains unchanged. The registry file remains unchanged, and it is highly recommended to consider s_attribute_recode as a helper for corpus_recode that will recode all s-attributes, all p-attributes, and will reset the encoding in the registry file.

Function p_attribute_rename can be used to rename a positional attribute. Note that the corpus is not refreshed (unloaded, re-loaded), so it may be necessary to restart R for changes to become effective.

Author(s)

Christoph Leonhardt, Andreas Blaette

Examples

library(RcppCWB)

# In this example, we pursue a "pure R" approach. To rely on the "CWB"
# method, you can use the cwb_install() function, which will download and
# install the CWB command line # tools within the package.

tokens <- readLines(system.file(package = "RcppCWB", "extdata", "examples", "reuters.txt"))

# Create new (and empty) directory structure

tmpdir <- normalizePath(tempdir(), winslash = "/")
registry_tmp <- fs::path(tmpdir, "registry")
data_dir_tmp <- fs::path(tmpdir, "data_dir", "reuters")
if (file.exists(fs::path(data_dir_tmp, "word.corpus"))){
  file.remove(fs::path(data_dir_tmp, "word.corpus"))
}
if (dir.exists(registry_tmp)) unlink(registry_tmp, recursive = TRUE)
if (dir.exists(data_dir_tmp)) unlink(data_dir_tmp, recursive = TRUE)
dir.create(registry_tmp)
dir.create(data_dir_tmp, recursive = TRUE)

# Now encode token stream

p_attribute_encode(
  corpus = "reuters",
  token_stream = tokens, p_attribute = "word",
  data_dir = data_dir_tmp, method = "R",
  registry_dir = registry_tmp,
  compress = FALSE,
  encoding = "utf8"
  )

# Create minimal registry file

regdata <- registry_data(
  id = "REUTERS", name = "Reuters Sample Corpus", home = data_dir_tmp,
  properties = c(encoding = "utf-8", language = "en"), p_attributes = "word"
)

regfile <- registry_file_write(
  data = regdata, corpus = "REUTERS",
  registry_dir = registry_tmp, data_dir = data_dir_tmp,
)

# Reload corpus and run query as a test

if (cqp_is_initialized()) cqp_reset_registry(registry_tmp) else cqp_initialize(registry_tmp)

cqp_query(corpus = "REUTERS", query = '[]{3} "oil" []{3};')
regions <- cqp_dump_subcorpus(corpus = "REUTERS")
kwic <- apply(
  regions, 1,
  function(region){
    ids <- cl_cpos2id("REUTERS", "word", registry_tmp, cpos = region[1]:region[2])
    words <- cl_id2str(corpus = "REUTERS", p_attribute = "word", registry = registry_tmp, id = ids)
    paste0(words, collapse = " ")
  }
)
kwic[1:10]

cwbtools documentation built on Nov. 27, 2023, 5:11 p.m.