encode-method: Encode CWB Corpus.

Description Usage Arguments Examples

Description

Encode CWB Corpus.

Usage

1
2
3
4
5
6
7
8
9
encode(.Object, ...)

## S4 method for signature 'data.frame'
encode(.Object, name, pAttributes = "word",
  sAttributes = NULL, registry = Sys.getenv("CORPUS_REGISTRY"),
  indexedCorpusDir = NULL, verbose = TRUE)

## S4 method for signature 'data.table'
encode(.Object, corpus, sAttribute)

Arguments

.Object

a data.frame to encode

...

further parameters

name

name of the (new) CWB corpus

pAttributes

columns of .Object with tokens (such as word/pos/lemma)

sAttributes

columns of .Object that will be encoded as structural attributes

registry

path to the corpus registry

indexedCorpusDir

directory where to create directory for indexed corpus files

verbose

logical, whether to be verbose

corpus

the name of the CWB corpus

sAttribute

a single s-attribute

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
## Not run: 
library(tm)
library(tibble)
library(tidytext)
library(plyr)
reut21578 <- system.file("texts", "crude", package = "tm")
reuters.tm <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
reuters.tibble <- tidy(reuters.tm)
# reuters.tibble[["topics_cat"]] <- sapply(
  reuters.tibble[["topics_cat"]],
  function(x) paste(x, collapse = "|")
)
reuters.tibble[["places"]] <- sapply(
 reuters.tibble[["places"]],
 function(x) paste(x, collapse = "|")
)
reuters.tidy <- unnest_tokens(
  reuters.tibble, output = "word", input = "text", to_lower = FALSE
  )
encode(reuters.tidy, name = "reuters", sAttributes = c("language", "places"))

## End(Not run)

nrauscher/corpus documentation built on May 23, 2019, 9:34 p.m.