corpus-class: Corpus class initialization

corpus-classR Documentation

Corpus class initialization


Corpora indexed using the 'Corpus Workbench' ('CWB') offer an efficient data structure for large, linguistically annotated corpora. The corpus-class keeps basic information on a CWB corpus. Corresponding to the name of the class, the corpus-method is the initializer for objects of the corpus class. A CWB corpus can also be hosted remotely on an OpenCPU server. The remote_corpus class (which inherits from the corpus class) will handle respective information. A (limited) set of polmineR functions and methods can be executed on the corpus on the remote machine from the local R session by calling them on the remote_corpus object. Calling the corpus-method without an argument will return a data.frame with basic information on the corpora that are available.


## S4 method for signature 'character'
corpus(.Object, registry_dir, server = NULL, restricted)

## S4 method for signature 'missing'



The upper-case ID of a CWB corpus stated by a length-one character vector.


The registry directory with the registry file describing the corpus (length-one character vector). If missing, the C representations of loaded corpora will be evaluated to get the registry directory with the registry file for the corpus.


If NULL (default), the corpus is expected to be present locally. If provided, the name of an OpenCPU server (can be an IP address) that hosts a corpus, or several corpora. The corpus-method will then instantiate a remote_corpus object.


A logical value, whether access to a remote corpus is restricted (TRUE) or not (FALSE).


Calling corpus() will return a data.frame listing the corpora available locally and described in the active registry directory, and some basic information on the corpora.

A corpus object is instantiated by passing a corpus ID as argument .Object. Following the conventions of the Corpus Workbench (CWB), Corpus IDs are written in upper case. If .Object includes lower case letters, the corpus object is instantiated nevertheless, but a warning is issued to prevent bad practice. If .Object is not a known corpus, the error message will include a suggestion if there is a potential candidate that can be identified by agrep.

A limited set of methods of the polmineR package is exposed to be executed on a remote OpenCPU server. As a matter of convenience, the whereabouts of an OpenCPU server hosting a CWB corpus can be stated in an environment variable "OPENCPU_SERVER". Environment variables for R sessions can be set easily in the .Renviron file. A convenient way to do this is to call usethis::edit_r_environ().



A length-one character vector, the upper-case ID of a CWB corpus.


Registry directory with registry file describing the corpus.


The directory where binary files of the indexed corpus reside.


If available, the info file indicated in the registry file (typically a file named .info in the data directory), or NA if not.


Full path to the template containing formatting instructions when showing full text output (fs_path object or NA).


If available, the type of the corpus (e.g. "plpr" for a corpus of plenary protocols), or NA.


Full name of the corpus that may be more expressive than the corpus ID.


Object of class character, whether the xml is "flat" or "nested".


The encoding of the corpus, given as a length-one character vector (usually 'utf8' or 'latin1').


Number of tokens (size) of the corpus, a length-one integer vector.


The URL (can be IP address) of the OpenCPU server. The slot is available only with the remote_corpus class inheriting from the corpus class.


If the corpus on the server requires authentication, the username.


If the corpus on the server requires authentication, the password.

See Also

Methods to extract basic information from a corpus object are covered by the corpus-methods documentation object. Use the s_attributes method to get information on structural attributes. Analytical methods available for corpus objects are size, count, dispersion, kwic, cooccurrences, as.TermDocumentMatrix.

Other classes to manage corpora: phrases-class, ranges-class, regions, subcorpus


use(pkg = "RcppCWB", corpus = "REUTERS")

# get corpora present locally
y <- corpus()

# initialize corpus object
r <- corpus("REUTERS")
r <- corpus ("reuters") # will work, but will result in a warning

# apply core polmineR methods
a <- size(r)
b <- s_attributes(r)
c <- count(r, query = "oil")
d <- dispersion(r, query = "oil", s_attribute = "id")
e <- kwic(r, query = "oil")
f <- cooccurrences(r, query = "oil")

# used corpus initialization in a pipe
y <- corpus("REUTERS") %>% s_attributes()
y <- corpus("REUTERS") %>% count(query = "oil")

# working with a remote corpus
## Not run: 
REUTERS <- corpus("REUTERS", server = Sys.getenv("OPENCPU_SERVER"))
count(REUTERS, query = "oil")
kwic(REUTERS, query = "oil")

GERMAPARL <- corpus("GERMAPARL", server = Sys.getenv("OPENCPU_SERVER"))
size(x = GERMAPARL)
count(GERMAPARL, query = "Integration")
kwic(GERMAPARL, query = "Islam")

p <- partition(GERMAPARL, year = 2000)
s_attributes(p, s_attribute = "year")
kwic(p, query = "Islam", meta = "date")

GERMAPARL <- corpus("GERMAPARLMINI", server = Sys.getenv("OPENCPU_SERVER"))
s_attrs <- s_attributes(GERMAPARL, s_attribute = "date")
sc <- subset(GERMAPARL, date == "2009-11-10")

## End(Not run)

polmineR documentation built on Nov. 2, 2023, 5:52 p.m.