corpus-class | R Documentation |
Corpora indexed using the 'Corpus Workbench' ('CWB') offer an efficient data
structure for large, linguistically annotated corpora. The corpus
-class
keeps basic information on a CWB corpus. Corresponding to the name of the
class, the corpus
-method is the initializer for objects of the corpus
class. A CWB corpus can also be hosted remotely on an
OpenCPU server. The remote_corpus
class
(which inherits from the corpus
class) will handle respective information.
A (limited) set of polmineR functions and methods can be executed on the
corpus on the remote machine from the local R session by calling them on the
remote_corpus
object. Calling the corpus
-method without an argument will
return a data.frame
with basic information on the corpora that are
available.
## S4 method for signature 'character'
corpus(.Object, registry_dir, server = NULL, restricted)
## S4 method for signature 'missing'
corpus()
.Object |
The upper-case ID of a CWB corpus stated by a
length-one |
registry_dir |
The registry directory with the registry file describing
the corpus (length-one |
server |
If |
restricted |
A |
Calling corpus()
will return a data.frame
listing the corpora
available locally and described in the active registry directory, and some
basic information on the corpora.
A corpus
object is instantiated by passing a corpus ID as argument
.Object
. Following the conventions of the Corpus Workbench (CWB), Corpus
IDs are written in upper case. If .Object
includes lower case letters,
the corpus
object is instantiated nevertheless, but a warning is issued
to prevent bad practice. If .Object
is not a known corpus, the error
message will include a suggestion if there is a potential candidate that
can be identified by agrep
.
A limited set of methods of the polmineR
package is exposed to be
executed on a remote OpenCPU server. As a matter of convenience, the
whereabouts of an OpenCPU server hosting a CWB corpus can be stated in an
environment variable "OPENCPU_SERVER". Environment variables for R sessions
can be set easily in the .Renviron
file. A convenient way to do this is
to call usethis::edit_r_environ()
.
corpus
A length-one character
vector, the upper-case ID of a CWB
corpus.
registry_dir
Registry directory with registry file describing the corpus.
data_dir
The directory where binary files of the indexed corpus reside.
info_file
If available, the info file indicated in the registry file
(typically a file named .info
info.md
in the data directory), or NA
if not.
template
Full path to the template containing formatting instructions
when showing full text output (fs_path
object or NA
).
type
If available, the type of the corpus (e.g. "plpr" for a corpus of
plenary protocols), or NA
.
name
Full name of the corpus that may be more expressive than the corpus ID.
xml
Object of class character
, whether the xml is "flat" or "nested".
encoding
The encoding of the corpus, given as a length-one
character
vector (usually 'utf8' or 'latin1').
size
Number of tokens (size) of the corpus, a length-one integer
vector.
server
The URL (can be IP address) of the OpenCPU server. The slot is
available only with the remote_corpus
class inheriting from the corpus
class.
user
If the corpus on the server requires authentication, the username.
password
If the corpus on the server requires authentication, the password.
Methods to extract basic information from a corpus
object are
covered by the corpus-methods
documentation object. Use the
s_attributes
method to get information on structural
attributes. Analytical methods available for corpus
objects are
size
, count
, dispersion
,
kwic
, cooccurrences
,
as.TermDocumentMatrix
.
Other classes to manage corpora:
phrases-class
,
ranges-class
,
regions
,
subcorpus
use(pkg = "RcppCWB", corpus = "REUTERS")
# get corpora present locally
y <- corpus()
# initialize corpus object
r <- corpus("REUTERS")
r <- corpus ("reuters") # will work, but will result in a warning
# apply core polmineR methods
a <- size(r)
b <- s_attributes(r)
c <- count(r, query = "oil")
d <- dispersion(r, query = "oil", s_attribute = "id")
e <- kwic(r, query = "oil")
f <- cooccurrences(r, query = "oil")
# used corpus initialization in a pipe
y <- corpus("REUTERS") %>% s_attributes()
y <- corpus("REUTERS") %>% count(query = "oil")
# working with a remote corpus
## Not run:
REUTERS <- corpus("REUTERS", server = Sys.getenv("OPENCPU_SERVER"))
count(REUTERS, query = "oil")
size(REUTERS)
kwic(REUTERS, query = "oil")
GERMAPARL <- corpus("GERMAPARL", server = Sys.getenv("OPENCPU_SERVER"))
s_attributes(GERMAPARL)
size(x = GERMAPARL)
count(GERMAPARL, query = "Integration")
kwic(GERMAPARL, query = "Islam")
p <- partition(GERMAPARL, year = 2000)
s_attributes(p, s_attribute = "year")
size(p)
kwic(p, query = "Islam", meta = "date")
GERMAPARL <- corpus("GERMAPARLMINI", server = Sys.getenv("OPENCPU_SERVER"))
s_attrs <- s_attributes(GERMAPARL, s_attribute = "date")
sc <- subset(GERMAPARL, date == "2009-11-10")
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.