Nothing
zenodo_get_tarball()
failse gracefully if Zenodo is not available #72.encode()
to prospectively supersed CorpusData
class. Includes argument properties
#13.corpus_reload()
for convenient unloading/reloading corpora #68.registry_set_name()
#13.cwb_get_url()
will get CWB v3.5 installation files #63.corpus_remove()
returns FALSE
(rather than failing with ERROR) when corpus
does not exist. More telling messages.p_attribute_encode()
has new argument quietly
passed into RcppCWB functions
cwb_compress()
cwb_huffcode()
and cwb_compress_rdx()
to control verbosity.$encode()
of CorpusData
class has new argument quietly
passed into
p_attribute_encode()
.$encode()
has new argument reload
to trigger unloading and reloading
corpus, to make s-attributes available #57.CorpusData$encode()
method uses messages from the cli package #59.p_attribute_encode()
rewritten, including explanation
of argument compress
and simplification of sample code #61.s_attribute_encode()
coerces input values
to character
(rather than failing) #62.s_attribute_encode()
,
p_attribute_encode()
and CorpusData$encode()
using a new (internal)
function, a telling message is issued if non-ASCII or uppercase characters are
used. The documentation has been augmented accordingly #48.p_attribute_encode()
checks whether files for encoded p-attribute
exist and fails gracefully with telling error message if yes #4.compress
defaults to FALSE
as corpus compression is not stable on Windows #3.corpus_as_tarball()
and corpus_copy()
now have registry_file_parse(corpus, registry_dir)[["home"]]
as default value, so that values are more consistent across corpus_*
functions #18.cwb_get_bindir()
tries to find cwb-config
system utility, if it is on the path.s_attribute_encode()
issues warning on Windows when using s-attribute 'id' #69.normalizePath()
by fs::path()
in p_attribute_encode()
#65.p_attribute_encode()
accepts multiple p-attributes if method is "CWB".registry_set_property()
for setting corpus properties in a
pipe.read_registry_file()
will keep 'registry_dir' and 'corpus'.registry_set_info
as new auxiliary function to set path to info file in
registry_data
object.corpus_install()
reverts to package zen4R to links of files at Zenodo #42.curl::curl_download()
replaces download.file()
in corpus_install()
if argument user
is NULL
(to avoid corrupted download from Zenodo) #53.install.packages()
has
been removed from the package. Using argument pkg
of corpus_install()
will
install corpora found in a package as system corpora defined in the default
registry directory #46.https://github.com/PolMine/cookbook
. Packages 'NLP' and 'openNLP' are no
longer suggested and the install.packages()
call (though not evaluated) is
omitted. Part of the fix for #46.fs::path()
function replaces base R file.path()
throughout to solidify
the generation of paths and to improve the readability of the code throughout.p_attribute_encode()
checks that the character
vector token_stream
does
not exceed the CWB corpus size limit (2^31 - 1) #40.zenodo_get_tarballurl()
is removed from package again (temporarily used when
zen4R package did not work).gparlsample_url_restricted
has been updated to replace a URL that has become
defunct.zenodo_get_tarballurl()
steps in for functionality of
the zen4R package temporarily not working #42. It is used internally by the
corpus_install()
function.zenodo_get_tarball()
fails gracefully if Zenodo is temporarily
not available.p_attribute_rename()
, corresponding to s_attribute_rename()
.p_attribute_encode()
will remove the [p_attr].corpus file as suggested my
cwb-makeall (if compress
is TRUE
).fs
package for a consistent handling of
paths (such as fs::path()
) is used more widely (#36).zenodo_get_tarball()
for downloading corpus tarballs from
Zenodo. Restricted access can be handled too (personalized URL with token).corpus_install()
has new argument load
to control whether corpus
is loaded after installation.pkg_add_description()
is declared deprecated. To alert users,
functionality of the lifecycle package is used (#1).as.vrt()
will generate valid *.vrt files from xml_document
input.cwb_corpus_dir()
, the function would falsely yield NA
results if the CWB directory would contain more than two directories.cwb_corpus_dir()
and cwb_registry_dir()
. Argument verbose
can be used to suppress this output.corpus_install()
function will abort with a FALSE return value if the requested tarball is not available (#34).s_attribute_rename()
can be used to rename s-attributes.corpus_get_version()
will derive the corpus version number from the registry file and return a numeric version
object (#16).writeBin()
to write long integer vectors has been overcome with R v4.0.0. A warning and a preliminary workaround to address this limitation when using p_attribute_encode()
for corpora with more than 536870911 tokens can therefore be dropped. For large corpora, the function will check the R version and issue the recommendation to install $ v4.0.0 or higher, if the size limitation (536870911) is relevant (#28).cwb_get_url()
will return the MD5 checksum of the compressed file as attribute 'md5'.cwb_install()
function will fail gracefully if downloading the CWB fails (returning NULL
). A new argument md5 will trigger checking the MD5 sum of the downloaded file (if provided). The default value of cwb_dir
is now a temporary directory.cwb_install()
is skipped on Solaris to ensure that Solaris CRAN tests will not fail: A CWB binary is not available for Solaris.tarball
of corpus_install()
.checksum
for the corpus_install()
function introduces functionality to check the integrity of a downloaded corpus tarball. If the tarball is downloaded from Zenodo (by stating a DOI using argument doi
), the md5 checksum included in the record's metadata is extracted internally and used for checking.corpus_copy()
will accept a new argument remove
. If TRUE
(the default value is FALSE
), files that have been copied will be removed. Removing files is reasonable to handle disk space parsimonously if the source corpus is at a temporary location where nobody will miss it.corpus_install()
function will abort with a warning and return value FALSE
rather than an error if the DOI is not offered by Zenodo.corpus_install()
is used to install a corpus from a tarball present locally, a somewhat confusing message suggested that the tarball was downloaded. This message is not shown any more.cwb_install()
now replaces an internally hardcoded argument cwb_dir
with an argument cwb_dir
; the function returns the directory where the CWB is installed rather than NULL
value.cwb_get_bindir()
now introduces an argument bindir
.compress
of p_attribute_encode(
now has default value FALSE
(#29).p_attribute_encode()
have been adapted so that GitHub Action unit test passes on Windows.FALSE
(#25).RCurl::url.exists()
, this function has been replaced by httr::http_error()
(#31).corpus_install()
function still showed some progress messages even when verbose
was set as FALSE
(argument not passed to corpus_copy()
. Fixed.get_encoding()
method would return NA
if localeToCharset()
fails to infer charset from locale. In this case, UTF-8 is assumed.corpus_install()
function tried to ask for user feedback when not in an interactive session. The function now checks whether it is possible to ask for user feedback.cwbtools::create_cwb_directories()
function did show if verbose
was FALSE
. Fixed.corpus_install()
gives much better and nicer reports on steps performed during
corpus downloads. User dialogues have been reworked thoroughly to provide better user guidance.use_corpus_registry_envvar()
function is called by corpus_install()
and will
amend the .Renviron file as appropriate if the user so desires.corpus_testload()
has been implemented to check whether a (newly installed) corpus
is accessible.jsonlite::fromJSON()
. The auxiliary function to get and
process information from Zenodo now ensures that newline characters are escaped such that
they can be processed.corpus_copy()
function did not set the path to the info file to the new data directory - corrected.corpus_install()
function failed when the registry_dir
got a NULL
value from the default call to cwbtools::cwb_registry_dir()
. But if the directories are created, the registry directory is there. Fixed.registry_file_compose()
when the
path includes any whitespace characters.curl
dependency of cwbtools
that may arise when devtools::install_github()
is used is addressed in an extended explanation in the README.md file how to install the development version of cwbtools
using remotes::install_github()
(#21).install_corpus()
function has been reworked thoroughly. Using system directories
for the registry and the corpus directory is now supported. This is a prerequisite that
corpora can be installed outside of R packages Installing corpora within corpora is
not allowed by CRAN.cwb_directories()
, cwb_registry_dir()
,
cwb_corpus_dir()
) will get the whereabouts of the registry directory and the corpus
directory. In particular, they consider that the polmineR package may have generated a
temporary corpus registry, resetting the CORPUS_REGISTRY environment variable.install_corpus()
function accepts an argument doi
to provide a Document Object
Identifier (DOI). At this stage, the DOI is assumed to be awarded by Zenodo. Information available at the Zenodo site will be resolved
to get the URL of a corpus tarball that can be downloaded. Upon installing a corpus
from Zenodo, the DOI and the version number will be written as corpus properties into
the registry file.corpus_install()
function will ask the user
for feedback if a corpus would be installed that is already present and that would be
deleted or overwritten.create_cwb_directories
and use_corpus_registry_envvar()
will assist users to create the required directory structure for CWB indexed corpora.CorpusData
class.pkg_add_corpus()
function will now create the cwb directories (registry and data directory) if necessary. Previously, these directories were required to exist before moving a corpus into a package, making it necessary to put dummy files into packages to keep R CMD build from issuing warnings and git from dropping these directories. Creating the directories on demand is a precondition for a CRAN release of data packages (#11).matrix
class will inherit from class array
. The new package version now takes into account that length(class(matrix(1:4,2,2)))
will return the value 2.pkgdown::build_site()
will generate a proper changelog page.s_atttribute_get_regions()
and s_attribute_get_values()
.corpus_install()
, using download.file()
replaces curl::curl_download()
for Windows because curl apparently is not able to process target filenames that include special characters.shortPathName()
is used.decode()
-method will turn a partition
into an Annotation
object from the NLP package.conll_get_regions()
-function will turn an CoNLL-style annotated token stream into a table with regions that can be encoded using s_attribute_encode()
.s_attribute_merge()
will merge two data.table
objects defining s-attributes, checking for overlaps.p_attribute_recode()
, s_attribute_recode()
, and supplementary s_attributed_files()
, and corpus_recode()
.tempdir()
is now wrapped as normalizePath(tempdir(), winslash = "/")
to avoid Problems on Windows, when different file separators may be used.file.path()
, the argument fsep
is "/" to prevent confusion of file seperators.corpus_copy()
is available to create a copy a corpus.s_attribute_encode
().cl_delete_corpus()
from RcppCWB is added to s_attribute_encode()
, so that newly added s-attributes can be used without restarting the R session.corpus_copy()
was defined (and documented) twice in a confusing manner. This is cleaned up.installed.packages()
were replaced to meet an advice of the CRAN team in the submission process.CorpusData$import_xml()
-methodCorpusData$add_corpus_positions()
(helper function .fn)download.file()
by install_corpus()
, if argument tarball is specified. This is a precondition for passing arguments to download password-protected corpora.Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.