corpus_install | R Documentation |
Utility functions to assist the installation and management of indexed CWB corpora.
corpus_install(
pkg = NULL,
repo = "https://PolMine.github.io/drat/",
tarball = NULL,
doi = NULL,
checksum = NULL,
lib = .libPaths()[1],
registry_dir,
corpus_dir,
ask = interactive(),
load = TRUE,
verbose = TRUE,
user = NULL,
password = NULL,
...
)
corpus_packages()
corpus_rename(
old,
new,
registry_dir = Sys.getenv("CORPUS_REGISTRY"),
verbose = TRUE
)
corpus_remove(corpus, registry_dir, ask = interactive(), verbose = TRUE)
corpus_as_tarball(
corpus,
registry_dir,
data_dir = registry_file_parse(corpus, registry_dir)[["home"]],
tarfile,
verbose = TRUE
)
corpus_copy(
corpus,
registry_dir,
data_dir = registry_file_parse(corpus, registry_dir)[["home"]],
registry_dir_new = fs::path(tempdir(), "cwb", "registry"),
data_dir_new = fs::path(tempdir(), "cwb", "indexed_corpora", tolower(corpus)),
remove = FALSE,
verbose = interactive(),
progress = TRUE
)
corpus_recode(
corpus,
registry_dir = Sys.getenv("CORPUS_REGISTRY"),
data_dir = registry_file_parse(corpus, registry_dir)[["home"]],
skip = character(),
to = c("latin1", "UTF-8"),
verbose = TRUE
)
corpus_testload(
corpus,
registry_dir = Sys.getenv("CORPUS_REGISTRY"),
verbose = TRUE
)
corpus_get_version(corpus, registry_dir = Sys.getenv("CORPUS_REGISTRY"))
corpus_reload(corpus, registry_dir, verbose = TRUE)
pkg |
Name of a package (length-one |
repo |
URL of the repository. |
tarball |
URL, S3-URI or local filename of a tarball with a CWB indexed
corpus. If |
doi |
The DOI (Digital Object Identifier) of a corpus deposited at Zenodo (e.g. "10.5281/zenodo.3748858".) |
checksum |
A length-one |
lib |
Directory for R packages, defaults to |
registry_dir |
The corpus registry directory. If missing, the result of
|
corpus_dir |
The directory that contains the data directories of indexed
corpora. If missing, the value of |
ask |
A |
load |
A |
verbose |
Logical, whether to be verbose. |
user |
A user name that can be specified to download a corpus from a password protected site. |
password |
A password that can be specified to download a corpus from a password protected site. |
... |
Further parameters that will be passed into |
old |
Name of the (old) corpus. |
new |
Name of the (new) corpus. |
corpus |
The ID of a CWB indexed corpus (in upper case). |
data_dir |
The data directory where the files of the CWB corpus live. |
tarfile |
Filename of tarball. |
registry_dir_new |
Target directory with for (new) registry files. |
data_dir_new |
Target directory for corpus files. |
remove |
A |
progress |
Logical, whether to show a progress bar. |
skip |
A character vector with s_attributes to skip. |
to |
Character string describing the target encoding of the corpus. |
A CWB corpus consists a set of binary files with corpus data kept together in a data directory, and a registry file, which is a plain test file that details the corpus id, corpus properties, structural and positional attributes. The registry file also specifies the path to the corpus data directory. Typically, the registry directory and a corpus directory with the data directories for individual corpora are within one parent folder (which might be called "cwb" by default). See the following stylized directory structure.
. |- registry/ | |- corpus1 | +- corpus2 | + indexed_corpora/ |- corpus1/ | |- file1 | |- file2 | +- file3 | +- corpus2/ |- file1 |- file2 +- file3
The corpus_install()
function will assist the installation of a
corpus. The following scenarios are offered:
If argument tarball
is a local tarball, the tarball will
be extracted and files will be moved.
If tarball
is a URL, the tarball will be downloaded from the online
location. It is possible to state user credentials using the arguments
user
and password
. Then the aforementioned installation (scenario 1) is
executed. If argument pkg
is the name of an installed package, corpus
files will be moved into this package.
If argument doi
is Document Object Identifier (DOI), the URL from
which a corpus tarball can be downloaded is derived from the information
available at that location. The tarball is downloaded and the corpus
installed. If argument pkg
is defined, files will be moved into a R
package, the syste registry and corpus directories are used otherwise. Note
that at this stage, it is assumed that the DOI has been awarded by
Zenodo
If argument pkg
is provided and tarball
is NULL
, corpora
included in the package will be installed as system corpora, using the
storage location specified by registry_dir
.
If the corpus to be installed is already available, a dialogue will ask the
user whether an existing corpus shall be deleted and installed anew, if
argument ask
is TRUE
.
corpus_packages()
will detect the packages that include CWB
corpora. Note that the directory structure of all installed packages is
evaluated which may be slow on network-mounted file systems.
corpus_rename()
will rename a corpus, affecting the name of the
registry file, the corpus id, and the name of the directory where data
files reside.
corpus_remove()
can be used to delete a corpus.
corpus_as_tarball()
will create a tarball (.tar.gz-file) with
two subdirectories. The 'registry' subdirectory will host the registry file
for the tarred corpus. The data files will be put in a subdirectory with
the corpus name in the 'indexed_corpora' subdirectory.
corpus_copy()
will create a copy of a corpus (useful for
experimental modifications, for instance).
corpus_get_version
parses the registry file and derives the
corpus version number from the corpus properties. The return value is a
numeric_version
class object. The corpus version is expected to follow
semantic versioning (three digits, e.g. '0.8.1'). If the corpus version
has another format or if it is not available, the return value is NA
.
corpus_reload()
will unload a corpus if necessary and reload it.
Useful to make new features of a corpus available after modification.
Returns logical value TRUE
if succesful, FALSE
if not.
Logical value TRUE
if installation has been successful, or FALSE
if not.
For managing registry files, see registry_file_parse
for switching to a packaged corpus.
registry_file_new <- fs::path(tempdir(), "cwb", "registry", "reuters")
if (file.exists(registry_file_new)) file.remove(registry_file_new)
corpus_copy(
corpus = "REUTERS",
registry_dir = system.file(package = "RcppCWB", "extdata", "cwb", "registry"),
data_dir = system.file(
package = "RcppCWB",
"extdata", "cwb", "indexed_corpora", "reuters"
)
)
unlink(fs::path(tempdir(), "cwb"), recursive = TRUE)
corpus <- "REUTERS"
pkg <- "RcppCWB"
s_attr <- "places"
Q <- '"oil"'
registry_dir_src <- system.file(package = pkg, "extdata", "cwb", "registry")
data_dir_src <- system.file(package = pkg, "extdata", "cwb", "indexed_corpora", tolower(corpus))
registry_dir_tmp <- fs::path(tempdir(), "cwb", "registry")
registry_file_tmp <- fs::path(registry_dir_tmp, tolower(corpus))
data_dir_tmp <- fs::path(tempdir(), "cwb", "indexed_corpora", tolower(corpus))
if (file.exists(registry_file_tmp)) file.remove(registry_file_tmp)
if (!dir.exists(data_dir_tmp)){
dir.create(data_dir_tmp, recursive = TRUE)
} else {
if (length(list.files(data_dir_tmp)) > 0L)
file.remove(list.files(data_dir_tmp, full.names = TRUE))
}
corpus_copy(
corpus = corpus,
registry_dir = registry_dir_src,
data_dir = data_dir_src,
registry_dir_new = registry_dir_tmp,
data_dir_new = data_dir_tmp
)
RcppCWB::cl_charset_name(corpus = corpus, registry = registry_dir_tmp)
corpus_recode(
corpus = corpus,
registry_dir = registry_dir_tmp,
data_dir = data_dir_tmp,
to = "UTF-8"
)
RcppCWB::cl_delete_corpus(corpus = corpus, registry = registry_dir_tmp)
RcppCWB::cqp_initialize(registry_dir_tmp)
RcppCWB::cl_charset_name(corpus = corpus, registry = registry_dir_tmp)
n_strucs <- RcppCWB::cl_attribute_size(
corpus = corpus, attribute = s_attr, attribute_type = "s", registry = registry_dir_tmp
)
strucs <- 0L:(n_strucs - 1L)
struc_values <- RcppCWB::cl_struc2str(
corpus = corpus, s_attribute = s_attr, struc = strucs, registry = registry_dir_tmp
)
speakers <- unique(struc_values)
Sys.setenv("CORPUS_REGISTRY" = registry_dir_tmp)
if (RcppCWB::cqp_is_initialized()) RcppCWB::cqp_reset_registry() else RcppCWB::cqp_initialize()
RcppCWB::cqp_query(corpus = corpus, query = Q)
cpos <- RcppCWB::cqp_dump_subcorpus(corpus = corpus)
ids <- RcppCWB::cl_cpos2id(
corpus = corpus, p_attribute = "word", registry = registry_dir_tmp, cpos = cpos
)
str <- RcppCWB::cl_id2str(
corpus = corpus, p_attribute = "word", registry = registry_dir_tmp, id = ids
)
unique(str)
unlink(fs::path(tempdir(), "cwb"), recursive = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.