partition: Initialize a partition.
In nrauscher/corpus: Toolkit for Corpus Analysis

Description Usage Arguments Details Value Author(s) Examples

Set up an object of the partition class. Frequency lists are computeted and kept in the stat-slot if pAttribute is not NULL.

partition(.Object, ...)

## S4 method for signature 'character'
partition(.Object, def = NULL, name = "",
  encoding = NULL, pAttribute = NULL, meta = NULL, regex = FALSE,
  xml = "flat", id2str = TRUE, type = NULL, mc = FALSE,
  verbose = TRUE, ...)

## S4 method for signature 'list'
partition(.Object, ...)

## S4 method for signature 'environment'
partition(.Object, slots = c("name", "corpus", "size",
  "pAttribute"))

## S4 method for signature 'partition'
partition(.Object, def = NULL, name = "",
  regex = FALSE, pAttribute = NULL, id2str = TRUE, type = NULL,
  verbose = TRUE, mc = FALSE, ...)

`.Object`	character-vector - the CWB-corpus to be used
`...`	parameters passed into the partition-method
`def`	list consisting of a set of character vectors (see details and examples)
`name`	name of the new partition, defaults to "
`encoding`	encoding of the corpus (typically "LATIN1 or "(UTF-8)), if NULL, the encoding provided in the registry file of the corpus (charset="...") will be used b
`pAttribute`	the pAttribute(s) for which term frequencies shall be retrieved
`meta`	a character vector
`regex`	logical (defaults to FALSE), if TRUE, the s-attributes provided will be handeled as regular expressions; the length of the character vectors with s-attributes then needs to be 1
`xml`	either 'flat' (default) or 'nested'
`id2str`	whether to turn token ids to strings (set FALSE to minimize object.size / memory consumption)
`type`	character vector (length 1) specifying the type of corpus / partition (e.g. "plpr")
`mc`	whether to use multicore (for counting terms)
`verbose`	logical, defaults to TRUE
`slots`	character vector

The function sets up a partition based on a list of s-attributes with respective values. The s-attributes defining the partition are a list, e.g. list(text_type="speech", text_year="2013"). The values of the list may contain regular expressions. To use regular expression syntax, set the parameter regex to "TRUE". Regular expressions are passed into grep, i.e. the regex syntax used in R needs to be used (double backlashes etc.).

The XML imported into the CWB may be "flat" or "nested". This needs to be indicated with the parameter xml (default is "flat"). If you generate a partition based on a flat XML structure, some performance gain may be achieved when ordering the sAttributes with decreasingly restrictive conditions. If you have a nested XML, it is mandatory that the order of the sAttributes provided reflects the hierarchy of the XML: The top-level elements need to be positioned at the beginning of the list with the s-attributes, the the most restrictive elements at the end.

If pAttribute is not NULL, a count of tokens in the corpus will be performed and kept in the stat-slot of the partition-object. The length of the pAttribute character vector may be 1 or more. If two or more p-attributes are provided, The occurrence of combinations will be counted. A typical scenario is to combine the p-attributes "word" or "lemma" and "pos".

An object of the S4 class 'partition'

Andreas Blaette

## Not run: 
   use(polmineR.sampleCorpus)
   spd <- partition(
     "PLPRBTTXT", text_party="SPD", text_type="speech"
     )
   kauder <- partition(
   "PLPRBTTXT", text_name="Volker Kauder", pAttribute="word"
   )
   merkel <- partition(
     "PLPRBTTXT", text_name=".*Merkel",
     pAttribute="word", regex=TRUE
     )
   sAttributes(merkel, "text_date")
   sAttributes(merkel, "text_name")
   merkel <- partition(
     "PLPRBTTXT", text_name="Angela Dorothea Merkel",
     text_date="2009-11-10", text_type="speech", pAttribute="word"
     )
   merkel <- subset(merkel, !word %in% punctuation)
   merkel <- subset(merkel, !word %in% tm::stopwords("de"))
   
   # a certain defined time segment
   if (require("chron")){
     firstDay <- "2009-10-28"
     lastDay <- "2009-11-11"
     days <- strftime(
       chron::seq.dates(
         from = strftime(firstDay, format="%m/%d/%Y"),
         to = strftime(lastDay, format="%m/%d/%Y"),
         by="days"),
       format="%Y-%m-%d"
       )
     period <- partition("PLPRBTTXT", text_date=days)
   }

## End(Not run)