pkg_utils: Create and manage packages with corpus data.

pkg_utilsR Documentation

Create and manage packages with corpus data.

Description

Putting CWB indexed corpora into R data packages is a convenient way to ship and share corpora, and to keep documentation and supplementary functionality with the data.

[Deprecated]

Usage

pkg_create_cwb_dirs(pkg = ".", verbose = TRUE)

pkg_add_corpus(
  pkg = ".",
  corpus,
  registry = Sys.getenv("CORPUS_REGISTRY"),
  verbose = TRUE
)

pkg_add_configure_scripts(pkg = ".")

pkg_add_description(
  pkg = ".",
  package = NULL,
  version = "0.0.1",
  date = Sys.Date(),
  author,
  maintainer = NULL,
  description = "",
  license = "",
  verbose = TRUE
)

pkg_add_creativecommons_license(
  pkg = ".",
  license = "CC-BY-NC-SA",
  file = system.file(package = "cwbtools", "txt", "licenses", "CC_BY-NC-SA_3.0.txt")
)

pkg_add_gitattributes_file(pkg = ".")

Arguments

pkg

Path to directory of data package or package name.

verbose

A logical value, whether to be verbose.

corpus

Name of the CWB corpus to insert into the package.

registry

Registry directory.

package

The package name (character), may not include special chars, and no underscores ('_').

version

The version number of the corpus (defaults to "0.0.1")

date

The date of creation, defaults to Sys.Date().

author

The author of the package, either character vector or object of class person.

maintainer

Maintainer, R package style, either character vector or person.

description

description of the data package.

license

The license.

file

Path to file with fulltext of Creative Commons license.

Details

pkg_creage_cwb_dirs will create the standard directory structure for storing registry files and indexed corpora within a package (./inst/extdata/cwb/registry and ./inst/extdata/cwb/indexed_corpora, respectively).

pkg_add_corpus will add the corpus described in registry directory to the package defined by pkg.

add_configure_script will add standardized and tested configure scripts configure for Linux and macOS, and configure.win for Windows to the top level directory of the data package, and file setpaths.R to tools subdirectory. The configuration mechanism ensures that the data directory is specified correctly in the registry files during the installation of the data package.

pkg_add_description will add a description file to the package.

pkg_add_creativecommons_license will license information to the DESCRIPTION file, and move file LICENSE to top level directory of the package.

pkg_add_gitattributes_file will add a file '.gitattributes' to the package. The file defines types of files that will be tracked by Git LFS, i.e. they will not be under conventional version control. This is suitable for large binary files, which is the scenario applicable for indexed corpus data.

References

Blätte, Andreas (2018). "Using Data Packages to Ship Annotated Corpora of Parliamentary Protocols: The GermaParl R Package", ParlaCLARIN 2018 Workshop Proceedings, available online here.

Examples

pkgdir <- fs::path_temp()
pkg_create_cwb_dirs(pkg = pkgdir)
pkg_add_description(
  pkg = pkgdir,
  package = "reuters",
  author = "cwbtools",
  description = "Reuters data package"
 )
pkg_add_corpus(
  pkg = pkgdir, corpus = "REUTERS",
  registry = system.file(package = "RcppCWB", "extdata", "cwb", "registry")
)
pkg_add_gitattributes_file(pkg = pkgdir)
pkg_add_configure_scripts(pkg = pkgdir)
pkg_add_creativecommons_license(pkg = pkgdir)

PolMine/cwbtools documentation built on March 5, 2024, 10:21 a.m.