decode: Decode corpus or subcorpus.

decode-methodR Documentation

Decode corpus or subcorpus.

Description

Decode corpus or subcorpus and return class specified by argument to.

Usage

decode(.Object, ...)

## S4 method for signature 'corpus'
decode(
  .Object,
  to = c("data.table", "Annotation", "AnnotatedPlainTextDocument"),
  p_attributes = NULL,
  s_attributes = NULL,
  mw = NULL,
  stoplist = NULL,
  decode = TRUE,
  verbose = TRUE
)

## S4 method for signature 'character'
decode(
  .Object,
  to = c("data.table", "Annotation"),
  s_attributes = NULL,
  p_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)

## S4 method for signature 'slice'
decode(
  .Object,
  to = c("data.table", "Annotation", "AnnotatedPlainTextDocument"),
  s_attributes = NULL,
  p_attributes = NULL,
  mw = NULL,
  stoplist = NULL,
  decode = TRUE,
  verbose = TRUE
)

## S4 method for signature 'partition'
decode(
  .Object,
  to = "data.table",
  s_attributes = NULL,
  p_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)

## S4 method for signature 'subcorpus'
decode(
  .Object,
  to = c("data.table", "Annotation", "AnnotatedPlainTextDocument"),
  s_attributes = NULL,
  p_attributes = NULL,
  mw = NULL,
  stoplist = NULL,
  decode = TRUE,
  verbose = TRUE
)

## S4 method for signature 'integer'
decode(.Object, corpus, p_attributes, boost = NULL)

## S4 method for signature 'data.table'
decode(.Object, corpus, p_attributes)

Arguments

.Object

The corpus or subcorpus to decode.

...

Further arguments.

to

The class of the returned object, stated as a length-one character vector.

p_attributes

The positional attributes to decode. If NULL (default), all positional attributes will be decoded.

s_attributes

The structural attributes to decode. If NULL (default), all structural attributes will be decoded.

mw

A character vector with s-attributes with multiword expressions. Used only if to is 'AnnotatedPlainTextDocument'.

stoplist

A character vector with terms to be dropped.

decode

A logical value, whether to decode token ids and struc ids to character strings. If FALSE, the values of columns for p- and s-attributes will be integer vectors. If TRUE (default), the respective columns are character vectors.

verbose

A logical value, whether to output progess messages.

corpus

A CWB indexed corpus, either a length-one character vector, or a corpus object.

boost

A length-one logical value, whether to speed up decoding a long vector of token ids by directly by reading in the lexion file from the data directory of a corpus. If NULL (default), the internal decision rule is that boost will be TRUE if the corpus is larger than 10 000 000 million tokens and more than 5 percent of the corpus are to be decoded.

Details

The primary purpose of the method is type conversion. By obtaining the corpus or subcorpus in the format specified by the argument to, the data can be processed with tools that do not rely on the Corpus Workbench (CWB). Supported output formats are data.table (which can be converted to a data.frame or tibble easily) or an Annotation object as defined in the package 'NLP'. Another purpose of decoding the corpus can be to rework it, and to re-import it into the CWB (e.g. using the 'cwbtools'-package).

An earlier version of the method included an option to decode a single s-attribute, which is not supported any more. See the s_attribute_decode() function of the package RcppCWB.

If .Object is an integer vector, it is assumed to be a vector of integer ids of p-attributes. The decode-method will translate token ids to string values as efficiently as possible. The approach taken will depend on the corpus size and the share of the corpus that is to be decoded. To decode a large number of integer ids, it is more efficient to read the lexicon file from the data directory directly and to index the lexicon with the ids rather than relying on RcppCWB::cl_id2str. The internal decision rule is to use the lexicon file when the corpus is larger than 10 000 000 million tokens and more than 5 percent of the corpus are to be decoded. The encoding of the character vector that is returned will be the coding of the locale (usually ISO-8859-1 on Windows, and UTF-8 on macOS and Linux machines).

The decode-method for data.table objects will decode token ids (column 'p-attribute_id'), adding the corresponding string as a new column. If a column "cpos" with corpus positions is present, ids are derived for the corpus positions given first. If the data.table neither has a column "cpos" nor columns with token ids (i.e. colummn name ending with "_id"), the input data.table is returned unchanged. Note that columns are added to the data.table in an in-place operation to handle memory parsimoniously.

Value

The return value will correspond to the class specified by argument to.

See Also

To decode a structural attribute, you can use the s_attributes-method, setting argument unique as FALSE and s_attribute_decode. See as.VCorpus to decode a partition_bundle object, returning a VCorpus object.

Examples

use("polmineR")
use(pkg = "RcppCWB", corpus = "REUTERS")

# Decode corpus as data.table
dt <- decode("REUTERS", to = "data.table")

# Decode corpus selectively
dt <- decode(
  "REUTERS",
  to = "data.table",
  p_attributes = "word",
  s_attributes = "id"
)

# Decode a subcorpus
dt <- corpus("REUTERS") %>%
  subset(id %in% c("127", "144")) %>%
  decode(s_attributes = "id", to = "data.table")

# Decode partition
dt <- partition("REUTERS", places = "kuwait", regex = TRUE) %>%
  decode(to = "data.table")

# Previous versions of polmineR offered an option to decode a single
# s-attribute. This is how you could proceed to get a table with metadata.
dt <- partition("REUTERS", places = "kuwait", regex = TRUE) %>% 
  decode(s_attribute = "id", decode = FALSE, to = "data.table")
dt[, "word" := NULL]
dt[,{list(cpos_left = min(.SD[["cpos"]]), cpos_right = max(.SD[["cpos"]]))}, by = "id"]

# Decode subcorpus as Annotation object
## Not run: 
if (requireNamespace("NLP")){
  library(NLP)
  p <- corpus("GERMAPARLMINI") %>%
    subset(date == "2009-11-10" & speaker == "Angela Dorothea Merkel")
  s <- as(p, "String")
  a <- as(p, "Annotation")
  
  # The beauty of having this NLP Annotation object is that you can now use 
  # the different annotators of the openNLP package. Here, just a short scenario
  # how you can have a look at the tokenized words and the sentences.

  words <- s[a[a$type == "word"]]
  sentences <- s[a[a$type == "sentence"]] # does not yet work perfectly for plenary protocols 
  
  doc <- decode(p, to = "AnnotatedPlainTextDocument")
}

## End(Not run)
 
# decode vector of token ids
y <- decode(0:20, corpus = "GERMAPARLMINI", p_attributes = "word")
dt <- data.table::data.table(cpos = cpos("GERMAPARLMINI", query = "Liebe")[,1])
decode(dt, corpus = "GERMAPARLMINI", p_attributes = c("word", "pos"))
y <- dt[, .N, by = c("word", "pos")]

polmineR documentation built on Nov. 2, 2023, 5:52 p.m.