Nothing
split()
sets type of subcorpus
to NA
, causing an error if another
split is performed. Fixed.split()
throwed misleading error message if s_attribute
not existing. The
error message is now telling #242.split()
was not implemented if s_attribute was child. Done #243.size()
for corpus
objects for scenario of nested s-attributes
addressed #231.enrich()
for subcorpus_bundle
objects (returning partition_bundle
now) subset()
implemented for subcorpus_bundle
obejcts #234.setAs()
-method from slice
to "AnnotatedPlainTextDocument"
that would prevent using GERMAPARLMINI as sample data.decode()
can return 'AnnotatedPlainTextDocument' from NLP package.as(x, "AnnotatedPlainTextDocument")
not available any more.decode()
has new argument "stoplist" to drop terms from 'AnnotatedPlainTextDocument'. Unused for other return values.get_template()
, examples added.show()
-method for corpus
objects gives an information whether a
template is available.as.markdown()
that would prevent fulltext display for
non-parliamentary-protocol documents.tooltips()
has new argument fmt
to provide flexibility to assign
tooltips based on corpus positions.href()
to add hypertext references to fulltext output.read()
has new argument annotation
to get values for arguments
highlight
, tooltips
and href
from a subcorpus object.format()
method used internally to produce output does not drop
s-attributes ending on "_id" any more #253.progress
is FALSE for the hits()
method for
character class objects, as a matter of consistency #252.split()
for corpus
objects.decode()
method for subcorpus
objects is now able to process nested
corpora. Performance gain for all scenarios.as.TermDocumentMatrix()
for bundle
objects speed ups instantiation of
simple_triplet_matrix
.s_attributes()
for bundle
objects is implemented much more
efficiently.get_token_stream()
and ngrams()
have new argument vocab
to pass
in alternative dictionary. Envisaged usage is to efficiently use pruned
vocabulary for decoding the token stream.ngrams()
for list
objects. Serves as worker for ngrams()
-method
for partition_bundle
objects.corpus()
.get_token_stream()
for numeric
input has new argument registry
to optionally specify registry directory.count()
for subcorpus
objects did not pass value of argument
verbose
to cpos()
, resulting in potentially unwanted verbosity. Fixed.subcorpus
using subset()
-method kept strucs for nested
attributes but assigned ancestor s-attribute to slot "s_attribute_strucs",
resulting in false counts, for example. Fixed.split()
-method for subcorpus
objects was not implemented correctly for
descendent attributes without values, so that getting subcorpora with sentences
in a subcorpus would have wrong result. Fixed.values
of method split()
for corpus
objects did not process
value FALSE
to split corpus by s-attribute without values #263. Fixed.s_attributes()
-method for context
objects. Returns s-attribute values
for the matches for query in context object.hits()
has new argument decoce
. If FALSE
, the strucs for are not
decoded.s_attributes()
-method for expression
assigns types of vectors matched
against as names if possible.subset
for corpus
and subset
objects will use integer struc values for
subsetting, if integer values are passed in logical expression.enrich()
-method for partition_bundle
objects #225.as.TermDocumentMatrix()
for partition_bundle
and bundle
,
to improve performance.partition_bundle()
-method for
partition
objects (more efficient instantiation of S4 objects).split()
-method for subcorpus
objects.store()
and mail()
have finally been removed from the
package.sample()
method for bundle
objects (and objects inheriting from the
bundle
class) did not yet use the new convention to use single square brackets
(not double brackets) for extracting a subset from the bundle
. Fixed #236.ngrams()
method for partition_bundle
objects,
introducing more efficient data handling, vectorization and parallelization.get_token_stream()
for partition_bundle
failed if all docs have
equal length (mapply()
issue). Fixed.as.DocumentTermMatrtix()
for large corpora significantly
improved for handlung large corpora.$
-method for corpus
is now used for accessing corpus properties,
replacing previous usage to inspect s-attributes.partition_bundle()
-method for context
class objects has improved
verbosity now and telling progress messages.capitalize()
for uppercasing first letter of elements
in a character vector.trim()
-method for classes DocumentTermMatrix
and TermDocumentMatrix
has
been updated. Arguments termsToKeep
, and docsToDrop
have been deprecated,
argument termsToDrop
is deprecated and replaced by terms_to_drop
and
docsToKeep
is deprecated and replaced by docs_to_keep
. New arguments
min_count
and min_doc_length
are introduced to drop rare terms and short
documents, respectively. The purpose of redesigning the trim()
-method is to
make it more useful for preparing matrices for topic modelling.subset()
for corpus
and subcorpus
objects will now process
indication of s-attribute without value, so that subsetting corpora for
s-attributes without values is now possible.split()
for subcorpus
objects will now also work if s_attribute
for splitting is not a sibling of the s-attribute the subcorpus is based on.as.speeches()
for subcorpus
objects refactored to work with
nested scenario.s_attributes()
will return NA
if s-attribute does not have values #234.hits()
-method for partition_bundle
objects passes argument p_attribute
to cpos()
#239.use()
returns TRUE
, if loading corpus in package was successful, or
FALSE
if not. Previously, the function aborted with an error, or returned
NULL
.subset()
would loose specific subcorpus class
(such as "plpr_subcorpus"). Fixed.html()
for subcorpus
reconstructs meta
equivalent to read()
for
subcorpus
objects.corpus
class throughout is an opportunity to keep the corpus ID
together with the registry directory of a corpus. And as we are able now to
handle corpora defined in different registry files, the temporary registry
directory is not necessary any more. It still exists, yet only for temporary
corpora and corpora that are described by registry files that cannot be
modified, i.e. corpora shipped in packages. The test corpus of the polmineR
package is an important respective scenario.get_token_stream()
now has an argument min_length
.registry_*()
functions are superseded by RcppCWB::corpus_*
functions and
throw a warning that they are deprecated.use(pkg = "RcppCWB", corpus = "REUTERS")
to
make the REUTERS corpus available.size()
works for partition
/subcorpus
with s-attribute
that is a child
of the s-attribute the object is based on #216.trim()
-method for context
objects has a new argument fn
for
supplying a (trimming) function to be applied all match contexts.s_attribute_date
is stated explicitly in
all examples.size()
has been refactored to work with nested corpora.encoding()
and replace method encoding<-
are defined for call
and quosure
objects to get and adjust the encoding, replacing a previously
unexported function .recode_call()
.subset()
methods for corpus
and subcorpus
objects now handle
expressions for subsetting as quosures, laying the ground to program against
subset(), see respective update of the examples, #212.bundle
objects with single square brackets is
developed now. Indexing with double brackets, suppling multiple values for i
is deprecated. The aim is a consistent behavior that a bundle
indexed by [
will always return a bundle
, and indexing with [[
always gets a single object
from the list of objects. #214use()
function now has an additional argument corpus
to specify which
corpus from a package shall be loaded (#138).get_token_stream()
-method for partition_bundle
objects is more memory
efficient (no exhaustion for big corpora) and faster.split()
-method for corpus
objects.split()
-method for corpus
objects offers progress bar.as.speeches()
for corpus
objects has new argument subset
, offering a
significantly faster approach than the method for subcorpus
objects in many
cases.size()
method will return NA
and issue a telling warning if the slot
corpus
and registry_dir
of the corpus
object are not filled #222.get_token_stream()
will return list of integer
values if decode
is
TRUE
(#213).trim()
on a context
object using arguments positivelist
or negativelist
, the count
slot as reported by length
was not updated.
Fixed. (#220)enrich()
method for context
objects has a new argument stat
for
creating / updating the data.table
in the slot stat
.subset()
for subcorpus
objects has been debugged to work with
nested corpora.polmineR.mdsub
configures substitutions that are applied on
markdown documents to prevent presence of characters that would be
misinterpreted as formatting instructions. Fixes #166.check_cqp_query()
now include a hint that argument
check
can be used to omit checking the CQP syntax to prevent false positives.
Addresses #171.cooccurrences()
(and context()
) to process more than one
p-attribute has been lost temporarily. Fixed. #208.hits()
method for partition
objects #215.trim()
on a context
object using arguments positivelist
or negativelist
, the count statistics reported in the stat
slot were not
updated. Fixed. (#220)kwic
object #218.subset()
would not work reliably with argument regex
if more than
one expression is passed #212. Fixed.terms()
did not work for subcorpus
objects. Fixed. #209as.speeches()
on a subcorpus
, the date may have been missing
from the object names. Fixed. #219minNchar
in the noise()
method would work exactly the
way opposite to the way intended #211.registry_dir
of a cooccurrences_bundle
derived from a
partition_bundle
was not filled, resulting in an error of the show()
-method
for the cooccurrences_bundle
. Fixed #222.cooccurrences()
method now includes example code
for creating a table using DT::datatable()
with buttons for exporting tables
(to Excel, for instance).dispersion()
method now accepts an argument fill
, a logical
value to
explicitly control whether (#160) zero matches for a value of a structural
attribute should be reported. The performance of adding columns (requred only if
two structural attributes are provided) is improved substantially by using the
reference semantic of the data.table package. If many columns are added at once,
a warning issued by the data.table package is supplemented by an further
explanatory warning of the polmineR package. Filling up the data.table
was
limited previously to freq = FALSE
, this limitation is lifted.html()
method is implemented for remote_subcorpus
objects.hits()
method is implemented for remote_corpus
and remote_subcorpus
class (#160).ranges
is introduced to manage ranges of corpus positions for
query matches. This is a preparatory step to remove an inconsistency from the
hits
class that mixed two very usages (getting ranges of corpus positions for
matches and getting counts).ranges
serves as the constructor to prepare a ranges
class
object. In combination with as.data.table()
, it replaces former functionality
of hits()
without argument s_attribute
.hits()
method is altered, making it much more consistent
than previously: The method will consistently return a hits
object.hits()
has a new argument fill
that will report zeros for
combinations of s-attributes with no matches for a query.subset
for the subset
method for remote_corpus
objects can
now be a call (#162), this is a basis for passing vectors to OpenCPU server. -
p_attributes()
implemented for remote_corpus
and remote_partition
.regions()
method (for corpus
class objects to start with) returns a
regions
class object with a regions matrix (slot cpos
) with regions for an
s-attribute (#176).get_token_stream()
-method for regions
and matrix
objects will now
accept a logical argument split
. If TRUE
, a list of character vectors is
returned. The envisaged use case is a fast decoding of sentences (#176).encoding()
method has been defined if argument object
is missing.
Calling encoding()
will return the session character set. If it cannot be
determined using localeToCharset()
, a UTF-8 session charset will be assumed.
Internally, encoding()
replaces a direct call of localeToCharset()
to avoid
errors that have occurred on GitHub Actions with Ubuntu 20.04 (#188).localeToCharset()
(NA
return value), a startup message will issue a warning that 'UTF-8' is assumed
(#188).size()
method is now able to handle nested s-attributes.trim()
method for context
objects will now accept a matrix with ranges
a positivelist
argument.highlight()
method now acceps matrix
objects as elements of the list
of items to be highlighted. It is treated as a set of regions, such as resulting
from cpos()
. Thus it is possible to highlight matches for CQP queries.context()
method.count()
-method for partition_bundle
objects failed with an opaque
error message if there were no query matches at all. There is now a check for
this scenario and the expected table is returned (zero values throughout.)corpus
class is now a superclass for the textstat
class, starting to
create a more coherent class structure in general. This is an important
preparatory step to be able to keep all registry files in the temporary registry
directory. To avoid a confusion in the class system resulting from the coerce
method from partition
to corpus
objects, this coerce method (defined by
setAs()
) has been removed. The get_template()
-method for partition
objects
using this coerce method has been removed - as it inherits the method anyway, it
is not needed any more. See #201.region
) and to consider the changing value of an s-attribute as
a boundary of a context (argument boundary
). New menu "boundary" and radio
buttons, conditional on presence of s-attributes "s" and/or "p".sAttribute
or pAttribute
(instead of s_attribute
and
p_attribute
) are still used with dispersion()
method, a warning is issued
declaring that the argument is deprecated..onDetach()
to .onUnload()
(#164).as.phrases()
method (#172).as.corpusEnc()
auxiliary function will now check whether non-convertible
characters lead to an NA
result and issue a warning how this warning can be
avoided (#151).context()
method for matrix
objects if arguments left
and right
are named integer
vectors. All
context()
benefit from the improved performance of this worker for creating
contexts for query matches.context
object.enrich()
method for context
objects will now perform an in-place
operation when adding new s-attributes.as.cqp()
function includes arguments check
and warn
for running
check_cqp_query()
on queries.context()
method for matrix
objects includes a new argument boundary
and relies on a new functionRcppCWB::region_matrix_context()
.verbose
of context()
-methods is now FALSE
.as.corpusEnc()
auxiliary function now includes a test whether input
character vector includes unexpected encodings and issues a warning if this is
the case.cpos()
method will now check for accidental leading and/or trailing
whitespace and remove it for token lookup. Note that hits()
, count()
and
dispersion()
will report queries without removing whitespace.count()
-method for partition_bundle
objects will be much
more efficient when many columns with zero matches need to be added. The
implementation avoids a data.table warning when the bulk action of adding new
columns exceeds the number of columns reserved by data.table objects.trim()
is removed (#197).encoding()
relies on l10n_info()
before using localeToCharset()
as a
matter of performance and robustness (#196).corpus
has a new slot registry_dir
. This is a preparatory step that
will facilitate managing corpora described by registry files in different
registry directories.corpus()
for corpus
-class objects has an argument
registry_dir
that will be required to distinguish corpora described by
registry files in different registry directories.fs_path
classes.registry_get_home()
and registry_get_encoding()
have
been replaced by RcppCWB functions cl_charset_name()
and corpus_data_dir()
with equivalent result, but faster due to immediate access to C representation
of the corpus.corpus()
method will deduce the registry directory from the C representation
of the corpus if possible.as.markdown()
has been removed,
making fulltext display (using read()
or html()
) much faster.corpus()
without any arguments now returns an expanded data.frame
reporting all slots of the corpus
class objects, skipping only the data
directory of the corpus.cpos()
method for matrix
objects that turns a matrix with corpus
positions into a vector of integer
values now relies on a C-level
implementation newly included in the RcppCWB package, that is significantly
faster than the best possible implementation in R.kwic()
shows row numbers, which is convenient
when referring to specific rows (#184).as.cqp()
now checks whether argument query
meets the expectation that
it is a query (#191).make_region_matrix()
, which has been used internally only, has
been removed. RcppCWB::s_attr_regions()
replaces the functionality.as.speeches()
method had not yet been implemented for nested corpora. A
limited rewrite makes this work now (#198).get_token_stream()
method
for partition_bundle
objects have been addressed: Multiple p-attributes can be
used without providing phrases
at the same time (#142) and using the subset
argument does not depend on using phrases
either (#141).as.sparseMatrix()
method is now also defined for DocumentTermMatrix
objects (was available previously ony for TermDocumentMatrix
objects).hits()
method (#195).get_type()
for subcorpus_bundle
returns NULL
if no type is defined as a
matter of consistency (#169).corpus
/subcorpus
includes invalid
s-attributes, the warning is telling and NULL
is returend (#179).cooccurrences()
method - left/right rather than window (#134).kwic
and context
now have argument region
as an intuitive
alternative to named character
vectors left
and right
when expanding match
to left and right limitation of an s-attribute.deparse()
within is resolved (#161).hits()
method for the slice
virtual class has been removed and the
implementation for hits
for the subcorpus
class is now real worker, also
invoked for hits()
for partition
. This removes a bug that occurred when
applying hits
on subcorpus
objects, which resulted in a count for the whole
corpus.show()
-method for partition
objects resvolved when more
than one s-attribute has been used to define partition
(#170).left
and right
of the context()
-method for matrix
objects,
the worker behind the context()
, kwic()
and cooccurrences()
methods did
not work as intended for character
values specifying an s-attribute. Fixed -
it is not possible to use these arguments (#173).as.TermDocumentMatrix()
or
as.DocumentTermMatrix()
when a s-attribute would not cover the entire corpus
has been removed (#177). In this vein, an efficiency (decoding token stream
twice) has been removed, so performance will also be better.subset()
for remote_corpus
objects(#181) has been fixed.context()
method, and kwic()
for partition
or subcorpus
objects
did not process left and right contexts correctly, if it was a named character
vector. Fixed.hits()
method failed for partition_bundle
objects when there were no
matches for the query. Fixed. (#199 and #163)p_attributes()
method for slice
objects had an error when decoding
the token stream. Fixed.format()
on a features_ngrams
object resulting in an
error when using knit_print()
on this object has been fixed (#200).edit()
method can now be invoked on a features
object (#165).context()
-method for partition_bundle
objects always required an
explicit statement of the argument positivelist
, which is not necessary.
Fixed. (#178)kwic()
method is gone as a result
of refactoring how the s-attribute is matched (#149). The argument progress
has been removed from the method.as.DocumentTermMatrix()
method mistakenly returned as
TermDocumentMatrix
object. Fixed (#146).noise()
method misleadingly handled the number of characters provided by
minNchar
as a maximum threshold, not as a minimum requirement (#135). Fixed.hits
class now describes the data.table
in the
stat
slot of the class in detail.decode()
method for data.table
objects shall serve as a more user-friendly access to the efficiency of the RcppCWB::cl_cpos2str()
function.data.frame
returned when calling corpus()
will now include a column with the encoding of the corpus.warn
argument of the get_template()
-method remained unused, resulting in a warning message even if warn
was FALSE
, resulting in a set of warning messages when calling corpus()
. The argument is used as intended now and defaults to FALSE
.as.markdown()
-method for subcorpus
objects now uses an (internal) default template accessible via polmineR:::default_template
, if no template is defined for a corpus. registry_get_encoding()
function returned a length-one character vector if the regular expression to extract the charset corpus property did not yield a match. To prevent errors, it now returns "latin1" as the CWB standard encoding (#159).knit_print()
-method for textstat
objects does not accept the three dots argument any more. As an installation of pandoc is necessary to include resulting htmlwidget
in an html document, the method will check now whether pandoc is available. If not, a formatted data.table
is returned.knit_print()
-method for kwic
objects does not have the pagelength
argument any more as it has been unused. The pagelength is controlled by the option polmineR.pagelength
. Internally, the method will call the method for the textstat
superclass of the kwic
class, which is newly robust against a missing installation of pandoc.chisquare()
method needs to increase the number of digits temporarily, but failed to revert to the original value as expected. One implication was, that rounding the values in data.table
objects would fail, and rounding in general yielded very strange results (#155). Fixed.as.data.table()
-method defined in the data.table
is now reexported
and defined and documented for the textstat
, regions
and bundle
class
that it can be used cleanly..importPolMineCorpus()
-function has been superseded by cwbtools::corpus_install()
and has been removed from the package.cat()
has been replaced by massage()
within functions throughout to meet CRAN
requirements.type
has been dropped from the html()
-method for partition_bundle
objects.html()
-method for character
class objects now serves as a worker to generate html
from markdown. The html()
-method for partition_bundle
objects did not return a html
class object as stated in the documentation object. Fixed.store()
-method has been declared defunct as it is unnecessary functionality that
bloats the package. Using format()
in combination with openxlsx::write.xlsx()
is the
recommended alternative workflow. mail()
-method has been declared defunct and has been removed from the package. A
more user-friendly workflow is to use export buttons of the DataTable widgets.Corpus
class has been removed from the package as it has beeen defunct for a while.set_template()
method on options that may be unnoticed
for the user and that potentially violate CRAN policies, the method has been dropped.s_attributes()
-method returned a data.table
mixing up rows / columns for subcorpora/partitions with a region matrix that would only include a single set of corpus decode()
-method now entails the possibility to decode structural and positional
attributes selectively, via new arguments p_attributes
and s_attributes
(#116).
Internally, the reliance on coerce()
-methods has been replaced by a simpler
if-else-syntax. The as(from, "Annotation")
option persists, however.phrases
was added to the count()
-method for partition_bundle
objects.remote_corpus
and the remote_subcorpus
class are replaced by a single slot restricted
(values TRUE
/FALSE
) to indicate if a user name and a password are necessary to access a corpus. A file following the conventions of CWB files is assumed to include the credentials for corpus access. This approach avoids the accessibility of the password. https://hub.docker.com/r/polmine/debian_polminer_min
). corpus()
-method that serves as a constructor either for the corpus
or the remote_corpus
class does not flag default values for the arguments user
and password
any more. If the argument server
is stated explicitly (not NULL
, default), these variables will get the value character()
. This way, a set of if/else statements can be omitted and it is much easier to implement methods for the remote_corpus
class for corpora that are password-protected, or not.as.list.bundle()
-method (previously, there has only been the S4 method). The nice consequence is that lapply()
and sapply()
can be used on bundle
objects now (a subcorpus_bundle
, for instance)count()
-method for partition_bundle
objects has been improved, it is twice as fast now (#137).p_attributes
method now accepts an argument decode
.p_attributes
-method has been implemented for partition_bundle
objects.polmineR()
, the mail-button has been dropped in the kwic, and code can be displayed (using code highlighting)phrases
argument is used are now also available when a phrases
object is not passed in.get_token_stream()
-method for partition_bundle
objects will now accept an argument phrases
(#128).merge()
-method for partition_bundle
-objects has been reworked: Substantial performance improvement by relying on RcppCWB::get_region_matrix
. Internally, the method performs a check whether the partition
/subcorpus
objects to be merged are non-overlapping. The default value for the argument verbose
is now FALSE
, as waiting time is much shorter.polmineR.warn.size
can be used to control the issuing of warnings
for large kwic
objects.Cooccurrences
objects had not been possible, now at least using integer
indices is possible (#114).count()
-method for
slice
class objects.corpus()
method for a character vector will now abort gracefully with a
message if more than one corpus is offered as .Object
.Cooccurrences()
-method will now accept zero values (0) for the arguments
left
and right
. Relevant for detecting bigrams / phrases.data.table
of a Cooccurrences
object, the NA values are
pushed to the end of the table now.concatenate()
method is a worker to collapse tokens into phrases.Cooccurrences
class objects, see
pmi()
-method.ngrams()
-method for class data.table
- useful if you need to work
with decoded corpora.pmi()
-method for the ngrams()
-method, to provide a workflow for
phrase detection.enrich()
for object of class Cooccurrences
will add columns with counts
for the co-occurring tokens to the data.table
in the slot 'stat'.data.table
in the stat
slot of an ngrams
object: Column names will now be "word_1" , "word_2" etc.count()
for subcorpus_bundle
objects (just callling callNextMethod()
internally) - useful to see the availability of the method in the documentation object.as.speeches()
-method for corpus
objects now supports parallelizationDocumentTermMatrix
against each other, as a safeguard that different approaches might lead to different results (#139).phrases
and as.phrases()
-method for ngrams
and matrix
objects. The
count()
-method now accepts an argument phrases
. See the documentation (?phrases
).s_attributes()
-method is now consistent with the usage of the unique
argument (#133).hits()
-method for partition_bundle
objects now accepts an argument s_attribute
to include metadata in results (#74).check_cqp_query()
function now has a further argument warn
. If TRUE
(default), a warning is issued, if the query is buggy. The as.phrases()
-method will use the function to avoid that buggy CQP queries may be generated.Corpus
class has been re-introduced (temporarily), to avoid an issue with the GermaParl package if the class is not available (#127).get_template()
-method is now defined for the corpus
class.count()
-method with arguments breakdown
is TRUE
and cqp
is TRUE
has been awfully slow. Fast now.boost
allows user to opt for the improvement, which will involve decoding the lexicon directly.merge()
-method is implemented for subcorpus_bundle
objects now, and has been implemented for subcorpus
objects (#76).kwic
view from a cooccurrences
object based on more than one p-attribute will work now (#119).decode()
-method has been defined for integer
vectors. Internally it will decide whether decoding token ids is speeded up by reading in the lexicon file directly. The behavior can be triggered explicitly by setting the argument boost
as TRUE
.get_token_stream()
-method will use the new decode()
-method for integer values internally. The argument boost
is used by the get_token_stream()
to control the approach.get_token_stream
for partition_bundle
. partition_bundle()
-methods defined for character
, corpus
and partition
objects now call the split()
-methods for corpus
and subcorpus
objects, resulting in a huge performance gain (#112).Cooccurrences()
-method (#117).corpus
class includes a (new) slot size
, just as the regions
and the subcorpus
classes.split()
-method for corpus
objects now accepts the argument xml
, to indicate whether the annotation structure of the corpus is flat or nested.partition
now includes a prototype defining default values for the slots 'stat' (a data.table
) and the slot 'size' (NA_integer_
). This avoids that an incomplete initialization of a partition
object will result in an error.kwic()
-method is now available for partition_bundle
/subcorpus_bundle
-objects (#73).kwic()
-method work correctly for partition
objects that result from a merge()
operation, the cpos()
-method for slice
objects will extract strucs based on the s-attribute defined in the slot
s_attr_strucs
rather than the last s-attribute in the list of the slot s-attributes
.subcorpus
is exported for usage in other packages.progress
of the count()
-method for partition_bundle
objects is now FALSE.get_type()
-method is now defined for the corpus
class.corpus
object into a subcorpus
object, to recover
functionality used (internally) that relied on the former Corpus
reference class.Cooccurrences()
-method is now defined for the corpus
-class, too. The Cooccurrences()
-method
for the character
class now relies on this method.Corpus
reference class has been dropped from the code altogether: As roxygen::roxygenize()
started to check the documentation of R6 classes and reference classes, the poor documentation of this class started to provoke many errors. Rather than starting to write documentation for a deprecated class, getting rid of an outdated and poorly documented class appeared to be the better solution.kwic
object from a cooccurrences
object. Introduced to
serve as a basis for quantitative/qualitative workflows, e.g. integrated in a flexdashboard.s_attributes()
method for corpus
objects when values are requested for an s-attribute that does not exist (#122).decode()
-method for subcorpus
objects, s-attributes were not decoded appropriately (#120). Fixed. When decoding a corpus/subcorpus, the struc column is kept (again)..onLoad()
whether polmineR is loaded from the repository directory will ensure that temporary registry files will not be gone when calling devtools::document()
(#68).as.speeches()
-method for corpus
objects, setting progress
as FALSE
did not
suppress the display of a progress bar. Solved.subcorpus_bundle
that resulted from CQP queries being turned into invalid column names.partition_bundle
was an empty string and calling count()
on this object has been removed (#121).?polmineR
)corpus
class has been put in a shape to become the default point of
departure of most workflows. All core methods are now available for the
corpus
class, and have been implemented newly if necessary, e.g. show()
and size()
-method. The constructor method for a corpus
object, the
corpus()
method, will now check whether the character vector with the corpus
ID refers to an available corpus, whether all letters are upper case and
issue informative warnings and error messages.s_attributes()
-method for corpus
objects has been reworked: It will decode
binary files directly, without reliance on the corpus library functions, which is
significantly faster.Corpus
reference class is now obsolete after the introduction of the
S4 corpus
class. To maintain the functionality not covered otherwise,
new generics get_info
and show_info
have been introduced and defined
for the corpus
class.subcorpus
class have been expanded so that this
class can supersede the partition
class: Methods newly available are
cpos()
, count()
, p_attributes()
, s_attributes()
get_token_stream()
,
and size()
. Technically, there is virtual slice
-class, from which
subcorpus
inherits (methods called via callNextMethod()
). subset()
-method for the corpus
and subcorpus
classes to generate subcorpora
(i.e. subcorpus
objects) has been introduced. It outperforms the
partition()
method. The subset()
-method for corpus
and subcorpus
objects
will be the default way to work with non standard evaluation in a manner that
feels "R-ish" (#40).zoom()
-method that has been introduced experimentally has
been dropped again in favor of the subset()
-method to get subcorpus
objects
from corpus
and subcorpus
objects. A set of experimental methods for an
initial check of the feasibility of a non-standard evaluation approach to
the generation of subcorpora has been dropped (methods $
, ==
, !=
,
zoom
for corpus
-class). partition
class (inheriting from
the textstat
class) to the subcorpus
class (inheriting from the textstat
class), there is a new coerce()
-method to turn a partition
object into
a subcorpus
object.remote_corpus
-class is the basis for accessing remote
corpora. A remote_subcorpus
can be derived from a remote_corpus
. Methods
available for remote corpora und subcorpora remain limited at this stage.subcorpus_bundle
class now inherits from partition_bundle
. This is not
intended to be a long-term solution, but facilitates the implementation of new
workflows based on the subcorpus
class rather than the partition
class.polmineR
did not have safeguards if
the suggested packages shiny and shinythemes were not installed. Now
there will be a conditional installation of the packages required for running
the shiny app.CorpusOrSubcorpus
has been removed. The ngrams
-method
now applies for corpus
and subcorpus
objects.label()
-method, present for a while, is superseded by a edit()
-method now.
It will call a shiny gadget either using DataTables or Handsontable. The former
Labels
reference class has been turned into a S4 class, because the
desired reference logic can also be achieved with a data.table
in a slot of
the labels class.table
-slot of the kwic
class has been renamed as stat
slot (a data.table
),
so that the kwic
class can now inherit from the textstat
class. The
enrich()
-method for objects of class kwic
now includes a new argument
extra
that will add extra tokens to the left of the windows for concordances so
that qualitative inspections for query hits can work with more context.as.TermDocumentMatrix()
and the as.DocumentTermMatrix()
-methods are now
also defined for kwic
objects. They work exactly the same as for the context
class. To avoid having to write new methods, a new neighborhood
virtual class has
been introduced. The aforementioned methods are defined for the virtual class and
are available for context and kwic class objects.get_token_stream()
for a partition_bundle
object.Cooccurrences()
-method is now available for subcorpus
-objects (#88).kwic
-object into a context
-object.
The neighborhood
virtual class could be discarded again, and a bug could be removed
that left an enrich()
-operation for kwic
objects (argument p_attribute
)
ineffectual (#103).cpos
to FALSE
in the kwic()
-method
has been solved (#106), and the documentation of the argument has been rewritten so that
includes a warning to use the argument falsely.use()
(#72).regex
to the cpos()
-method (for corpus
objects), which
will interpret argument query
as a regular expression. This may be faster than
taking query
as an outright CQP query.dispersion
-method (#92).p_attribute
and positivelist
by default.format()
-method is used to create proper output in the cooccurrences of the
shiny app.registry()
-function.ll
-method had been somewhat mixed up, which is repaired
now. Tokens with NA values for the ll-test will show up at the end of the table.registry_move()
-function, used only internally at this stage, is exported now
so that it can be used by other packages.the get_token_stream()
-method for regions
objects was a
data.table
. The behavior is now in line with the other get_token_stream()
methodstempcorpus()
-method and the tempcorpus
class have been removed from the package,
having become utterly deprecated.summary()
-method for partition
-class objects has been turned into a method
for the count
-class, to eliminate an inconsistency. The example of a workflow has been
moved to the documentation object for the count
-class.browse()
-method has not proven to be useful and has been removed from the package.
A new browse()
-function is introduced to throw a warning, if browse should be
called nevertheless.split()
-method for partition
-objects improved the readability
of the code, but the performance gain is minimal.kwic_bundle
-class has been introduced, a list of kwic
objects can be turned
into this new class using as.bundle
.context()
-method will now take again as input character vectors for the arguments
left
and right
to expand to the left and right boundaries of the designated
region (#87).kwic()
-method. This ensures that subsequent highlighting operations can assign
new colors (#38).dispersion()
that results are reported for all
values of structural attributes, including those with zero matches. (#104)cpos
-method for matrix
which unfolds a matrix with regions
of corpus positions, useful for operations that require many calls.count
-method for partition_bundle
has been reworked and is much faster and more
memory efficient. as.TermDocumentMatrix()
for partition_bundle
optimized to work efficiently
with large corpora.as.corpusEnc()
-function uses the localeToCharset()
-function from the utils
package to determine the charset of input strings. On RStudio Server, we have seen
cases when the return value is NA. Then it will be assumed that the locale is UTF-8.context()
/kwic()
method that led to superfluous words in the
right context.as.data.frame()
-method for kwic
-objects
when no metadata were added.count()
-method for partition_bundle
-objects did not perform iconv()
if
necessary - this has been corrected.kwic
object did not reduce the cpos
table
concurringly. This has been corrected.as.speeches()
-method failed to handle situations correctly, when one speaker
occurring in the corpus only contributed one single region to the entire corpus (#86).
This behavior has been debugged.partition_bundle
started to throw a warning that an argument arrives
at the cpos()
-method that is not used. The cause for the warning message is removed,
an additional unit test has been introduced to recognize issues with the
count
-method (#90).kwic()
-method threw an error when trimming the matches by using a positivelist
or a stoplist resulted in no remaining matches. The method will now return a NULL
object and keep issuing a warning if no matches remain after filtering (#91).subcorpus
object, resulting in false results when counting over
subcorpora. Fixed.dispersion()
(#62).as.speeches()
-method, the argument verbose
was not used (#64) - this had
been addressed when solving issue #86.subcorpus
into a String
was removed:
A semicolon was not recognized as a punctuation mark. This makes decoding subcorpora
as Annotation
more robust. The respective unit test has been updated.read()
on a kwic
object works again (#84).as.VCorpus()
method that failed are now ok (#77). The reason was
that get_token_stream()
assumed implicitly that a p-attribute "pos" is present,
which is not the case for the REUTERS test corpus.s_attributes
-method was removed that would make retrieving the
metadata for the first strucs (index 0) of a s-attribute impossible.as.DocumentTermMatrix
that started to occur with the introduction
of the subcorpus_bundle
class (#100).kwic
-method for character
that prevented using different values for
right and left context (#101).as.DocumentTermMatrix()
on a corpus stated
by corpus ID / length-one character vector (#105).markdown::markdownToHTML
by a direct
call to markdown::renderMarkdown
. On this occasion, some overhead preparing
fulltext output has been removed.kwic
objects has
been removed (#102).as.TermDocumentMatrix()
-method for neighborhood
-objects returned a
DocumentTermMatrix (unintendedly), this bug is removed now.pmi()
-method and t_test()
-method.s_attributes()
-method for corpus
-class.corpus
-class has been rewritten entirely, and the
documentation for the remote_corpus
-class has been integrated, whereas methods
applicable to the remote_corpous
-class were integrated into the documentation
objects for the respective methods.get_token_stream()
-method has been reworked and expanded
thoroughly (#65). On this occasion, test coverage for the method has been improved
significantly. (Everything is tested now apart from parallelization.)Cooccurrences()
-method and a Cooccurrences
-class have been migrated from the (experimental) polmineR.graph package to polmineR to generate and manage all cooccurrences in a corpus/partition
. A cooccurrenes()
-method produces a subset of Cooccurrences
-class object and is the basis for ensuring that results are identical.data_dir()
will return this temporary data directory. The use()
-function will now check for non-ASCII characters in the path to binary corpus data and move the corpus data to the temporary data directory (a subdirectory of the directory returned by data_dir()
), if necessary. An argument tmp
added to use()
will force using a temporary directory. The temporary files are removed when the package is detached. zoom()
-method. See documentation for (new) corpus
-class (?"corpus-class"
) and extended documentation for partition
-class (?"partition-class"
). A new corpus()
-method for character vector serves as a constructor. This is a beginning of somewhat re-arranging the class structure: The regions
-class now inherits from the new corpus
-class, and a new subcorpus
-class inherits from the regions
-class.check_cqp_query()
offers a preliminary check whether a CQP query may be faulty. It is used by the cpos()
-method, if the new argument check
is TRUE. All higher-level functions calling cpos()
also include this new argument. Faulty queries may still cause a crash of the R session, but the most common source is prevent now, hopefully.format()
-method is defined for textstat
, cooccurrences
, and features
, moving the formatting of tables out of the view()
, and print()
-methods. This will be useful when including tables in R Markdown documents.highlight()
-method for character
and html
objects now has the arguments regex
and perl
, so that regular expressions can be used for highlighting (#99).as.data.frame()
-method for kwic
-objects has seen a small performance improvement, and is more robust now if the order of columns changes unexpectedly.registry()
and data_dir()
now accept an argument pkg
. The functions will return the path to the registry directory / the data directory within a package, if the argument is used.data.table
-package used to be imported entirely, now the package is imported selectively. To avoid namespace conflicts, the former S4 method as.data.table()
is now a S3 method. Warnings appearing if the data.table
package is loaded after polmineR are now omitted.coerce()
-methodes to turn textstat
, cooccurrences
, features
and kwic
objects into htmlwidgets now set a pageLength
.partition_bundle
objects: [[<-
, $
, $<-
textstat
objects.p_attribute
has been added to the kwic
-class; kwic()
-methods and methods to process kwic
-objects are now able to use the attribute thus indicated, and not just the p-attribute "word".size()
-method for context
-objects will return the size of the corpus of interest (coi) and the reference corpus (ref).encoding()
-method for character vector.name()
-method for character vector.count()
-method for context
-objects will return the data.table
in the stat
-slot with the counts for the tokens in the window.decode()
-function replaces a decode()
-method and can be applied to partitions. The return value is a data.table
which can be coerced to a tibble
, serving as an interface to tidytext (#37).ngrams()
-method will work for corpora, and a new show()
-method for textstat
-object generates a proper output (#27).tempdir()
is wrapped into normalizePath(..., winslash = "/"), to avoid mixture of file separators in a path, which may cause problems on Windows systems.kwic()
-method for corpora returned one surplus token to the left and to the right of the query. The excess tokens are not removed.kwic()
-method for character
-objects method did not include the correct position of matches in the cpos
slot. Corrected.partition_bundle
using the as.speeches()
-method, an error could occur when an empty partition has been generated accidentaly. Has been removed. (#50)as.VCorpus()
-method is not available if the tm
-package has been loaded previously. A coerce method (as(OBJECT, "VCorpus")) solves the issue. The
as.VCorpus()`-method is still around, but serves as a wrapper for the formal coerce-method (#55).verbose
as used by the use()
-method did not have any effect. Now, messages are not reported as would be expected, if verbose
is FALSE
. On this occasion, we took care that corpora that are activated are now reported in capital letters, which is consistent with the uppercase logic you need to follow when using corpora. (#47)context()
-method would occurr at the very beginning or very end of a corpus and the window would transgress the beginning / end of the corpus without being checked (#44).as.speeches()
-function caused an error when the type of the partition was not defined. Solved (#57).TermDocumentMatrix
from a partition_bundle
if the partitions in the partition_bundle
were not named. The fix is to assign integer numbers as names to the partitions (#58).ll()
, and chisquare()
-methods to make the statistical procedure used transparent.cooccurrences()
-method to explain subsetting results vs applying positivelist/negativelist (#28).round()
-method for textstat
-objects that will show up in documentation of textstat
class.mail()
-method (#31).decode()
-function, using the REUTERS corpus replaces the usage
of the GERMAPARLMINI corpus, to reduce time consumed when checking the package.weigh()
-method has been implemented for the classes count
and count_bundle
. Via inheritance, it will also be available for the partition
- and partition_bundle
-classes. Then, a new summary()
-method for partition
-class objects is introduced. If the object has been weighed, the list that is returned will include a report on weights. There is an example that explains the workflow.partition_bundle
-method for context
-objects has been reworked entirely (and is working again);
a new partition
-method for context
-objects has been introduced. Buth steps are intended for workflows for dictionary-based sentiment analysis.highlight()
-method is now implemented for class kwic
. You can highlight words in the neighborhood of a node that are part of a dictionaty.knit_print()
-method for textstat
- and kwic
-objects offers a seamless inclusion of analyses in Rmarkdown documents.coerce()
-method to turn a kwic
-object into a htmlwidget has been singled out from the show()
-method for kwic
-objects. Now it is possible to generate a htmlwidget from a kwic object, and to include the widget into a Rmarkdown document.coerce()
-method to turn textstat
-objects into an htmlwidget (DataTable), very useful for Rmarkdown documents such as slides.html()
-method will allow to define a scroll box. Useful to embed a fulltext output to a Rmarkdown document.partition_bundle
-class, rather than inheriting from bundle
-class directly, will now inherit from the count_bundle
-classuse()
-function is limited now to activating the corpus in data packages. Having introduced the session registry, switching registry directories is not needed any more.as.regions()
-function has been turned into a as.regions()
-method to have a more generic tool.context
-method, so that full use of data.table
speeds up things.highlight()
-method allows definitions of terms to be highlighted to be passed in via three dots (...);
no explicit list necessary.as.character()
-method for kwic-class objects is introduced.size_coi
-slot (coi for corpus of interest) of the context
-object included the node; the node (i.e. matches for queries) is excluded now from the count of size_coi.use()
, the registry directory is reset for CQP, so that the corpora in the package that have been activated can be used with CQP syntax.s_attributes()
-method for partition
-objects: "fast track" was activated without preconditions.kwic
-output after highlighting.meta
has been renamed to s_attributes
for the kwic()
-method for context
-objects, and for the enrich()
-method for kwic
-objects.s_attribute
to check for integrity within
a struc has been renamed into boundary
.kwic
-objects has been reworked thoroughly.Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.