encode: Encode CWB Corpus.

encodeR Documentation

Encode CWB Corpus.

Description

[Experimental]

Usage

encode(x, ...)

## S4 method for signature 'data.frame'
encode(
  x,
  corpus,
  s_attributes = NULL,
  encoding = "utf8",
  registry_dir = fs::path(tempdir(), "cwb_registry"),
  data_dir = fs::path(tempdir(), "cwb_data_dir", tolower(corpus)),
  properties = c(),
  method = c("R", "CWB"),
  verbose = TRUE,
  compress = FALSE,
  reload = TRUE,
  quietly = TRUE
)

Arguments

x

A data.frame or an object inheriting from data.frame (such as tibble, data.table).

...

Further arguments (unused).

corpus

ID of the CWB corpus to create.

s_attributes

A list of data.frame objects with columns 'cpos_left' and 'cpos_right' and columns with s-attributes, the names of which will serve as names of s-attributes. It s_attributes is a data.frame, it will be coerced to a list.

encoding

Encoding as defined in the charset corpus property of the registry file for the corpus ('latin1' to 'latin9', and 'utf8').

registry_dir

Registry directory.

data_dir

The data directory for the binary files of the corpus.

properties

A named character vector with corpus properties that will be added to the registry file describing the corpus. Names of the vector indicate a property (such as "version") and the values of the vector the values of a corpus property.

method

Either 'CWB' or 'R', defaults to 'R'. See section 'Details'.

verbose

A logical value, whether to output progress messages.

compress

A logical value, whether to run RcppCWB::cwb_huffcode() and RcppCWB::cwb_compress_rdx() (method 'R'), or command line tools cwb-huffcode and cwb-compress-rdx (method 'CWB'). Defaults to FALSE as compression is not stable on Windows.

reload

A logical value, whether to reload the corpus to make it immediatedly available.

quietly

A logical value passed into RcppCWB::cwb_makeall(), RcppCWB::cwb_huffcode() and RcppCWB::cwb_compress_rdx to control verbosity of these functions.

Examples

# This is an example we run conditionally as packages are suggested.

dplyr_available <- requireNamespace("dplyr")
tidytext_available <- requireNamespace("tidytext")
quanteda_available <- requireNamespace("quanteda")

if (dplyr_available && tidytext_available && quanteda_available){

library(dplyr) # pipe would not be available otherwise
library(tidytext)

registry_tmp <- fs::path(tempdir(), "cwb_registry")
dir.create(registry_tmp)

tidydata <- quanteda::data_char_ukimmig2010 %>%
   as.data.frame() %>%
   as_tibble(rownames = "party") %>%
   rename(`text` = ".")
   
tokenstream <- tidydata %>%
   unnest_tokens(word, text, to_lower = FALSE, strip_punct = FALSE) %>%
   mutate(cpos = 0L:(nrow(.) - 1L))
   
metadata <- tokenstream %>% 
  group_by(party) %>% 
  summarise(cpos_left = min(cpos), cpos_right = max(cpos))

tokenstream %>%
  select(-cpos, -party) %>%
  encode(
    corpus = "UKIMMIG2010",
    s_attributes = metadata,
    properties = c(lang = "en")
  )
  
}

PolMine/cwbtools documentation built on May 1, 2024, 12:01 a.m.