decode-method | R Documentation |
Decode corpus
or subcorpus
and return class specified by argument to
.
decode(.Object, ...)
## S4 method for signature 'corpus'
decode(
.Object,
to = c("data.table", "Annotation", "AnnotatedPlainTextDocument"),
p_attributes = NULL,
s_attributes = NULL,
mw = NULL,
stoplist = NULL,
decode = TRUE,
verbose = TRUE
)
## S4 method for signature 'character'
decode(
.Object,
to = c("data.table", "Annotation"),
s_attributes = NULL,
p_attributes = NULL,
decode = TRUE,
verbose = TRUE
)
## S4 method for signature 'slice'
decode(
.Object,
to = c("data.table", "Annotation", "AnnotatedPlainTextDocument"),
s_attributes = NULL,
p_attributes = NULL,
mw = NULL,
stoplist = NULL,
decode = TRUE,
verbose = TRUE
)
## S4 method for signature 'partition'
decode(
.Object,
to = "data.table",
s_attributes = NULL,
p_attributes = NULL,
decode = TRUE,
verbose = TRUE
)
## S4 method for signature 'subcorpus'
decode(
.Object,
to = c("data.table", "Annotation", "AnnotatedPlainTextDocument"),
s_attributes = NULL,
p_attributes = NULL,
mw = NULL,
stoplist = NULL,
decode = TRUE,
verbose = TRUE
)
## S4 method for signature 'integer'
decode(.Object, corpus, p_attributes, boost = NULL)
## S4 method for signature 'data.table'
decode(.Object, corpus, p_attributes)
.Object |
The |
... |
Further arguments. |
to |
The class of the returned object, stated as a length-one
|
p_attributes |
The positional attributes to decode. If |
s_attributes |
The structural attributes to decode. If |
mw |
A |
stoplist |
A |
decode |
A |
verbose |
A |
corpus |
A CWB indexed corpus, either a length-one |
boost |
A length-one |
The primary purpose of the method is type conversion. By obtaining the corpus
or subcorpus in the format specified by the argument to
, the data can be
processed with tools that do not rely on the Corpus Workbench (CWB).
Supported output formats are data.table
(which can be converted to a
data.frame
or tibble
easily) or an Annotation
object as defined in the
package 'NLP'. Another purpose of decoding the corpus can be to rework it,
and to re-import it into the CWB (e.g. using the 'cwbtools'-package).
An earlier version of the method included an option to decode a single
s-attribute, which is not supported any more. See the s_attribute_decode()
function of the package RcppCWB.
If .Object
is an integer
vector, it is assumed to be a
vector of integer ids of p-attributes. The decode
-method will
translate token ids to string values as efficiently as possible. The
approach taken will depend on the corpus size and the share of the corpus
that is to be decoded. To decode a large number of integer ids, it is more
efficient to read the lexicon file from the data directory directly and to
index the lexicon with the ids rather than relying on
RcppCWB::cl_id2str
. The internal decision rule is to use the lexicon
file when the corpus is larger than 10 000 000 million tokens and more than
5 percent of the corpus are to be decoded. The encoding of the
character
vector that is returned will be the coding of the locale
(usually ISO-8859-1 on Windows, and UTF-8 on macOS and Linux machines).
The decode
-method for data.table
objects will decode
token ids (column 'p-attribute
_id'), adding the corresponding string as a
new column. If a column "cpos" with corpus positions is present, ids are
derived for the corpus positions given first. If the data.table
neither has a column "cpos" nor columns with token ids (i.e. colummn name
ending with "_id"), the input data.table
is returned unchanged. Note
that columns are added to the data.table
in an in-place operation to
handle memory parsimoniously.
The return value will correspond to the class specified by argument
to
.
To decode a structural attribute, you can use the
s_attributes
-method, setting argument unique
as
FALSE
and s_attribute_decode
. See
as.VCorpus
to decode a partition_bundle
object,
returning a VCorpus
object.
use("polmineR")
use(pkg = "RcppCWB", corpus = "REUTERS")
# Decode corpus as data.table
dt <- decode("REUTERS", to = "data.table")
# Decode corpus selectively
dt <- decode(
"REUTERS",
to = "data.table",
p_attributes = "word",
s_attributes = "id"
)
# Decode a subcorpus
dt <- corpus("REUTERS") %>%
subset(id %in% c("127", "144")) %>%
decode(s_attributes = "id", to = "data.table")
# Decode partition
dt <- partition("REUTERS", places = "kuwait", regex = TRUE) %>%
decode(to = "data.table")
# Previous versions of polmineR offered an option to decode a single
# s-attribute. This is how you could proceed to get a table with metadata.
dt <- partition("REUTERS", places = "kuwait", regex = TRUE) %>%
decode(s_attribute = "id", decode = FALSE, to = "data.table")
dt[, "word" := NULL]
dt[,{list(cpos_left = min(.SD[["cpos"]]), cpos_right = max(.SD[["cpos"]]))}, by = "id"]
# Decode subcorpus as Annotation object
## Not run:
if (requireNamespace("NLP")){
library(NLP)
p <- corpus("GERMAPARLMINI") %>%
subset(date == "2009-11-10" & speaker == "Angela Dorothea Merkel")
s <- as(p, "String")
a <- as(p, "Annotation")
# The beauty of having this NLP Annotation object is that you can now use
# the different annotators of the openNLP package. Here, just a short scenario
# how you can have a look at the tokenized words and the sentences.
words <- s[a[a$type == "word"]]
sentences <- s[a[a$type == "sentence"]] # does not yet work perfectly for plenary protocols
doc <- decode(p, to = "AnnotatedPlainTextDocument")
}
## End(Not run)
# decode vector of token ids
y <- decode(0:20, corpus = "GERMAPARLMINI", p_attributes = "word")
dt <- data.table::data.table(cpos = cpos("GERMAPARLMINI", query = "Liebe")[,1])
decode(dt, corpus = "GERMAPARLMINI", p_attributes = c("word", "pos"))
y <- dt[, .N, by = c("word", "pos")]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.