NEWS.md
In trinker/termco: Counts of Terms and Substrings

NEWS

Releases will be numbered with the following semantic versioning format:

And constructed with the following guidelines:

Breaking backward compatibility bumps the major (and resets the minor and patch)
New additions without breaking backward compatibility bumps the minor (and resets the patch)
Bug fixes and misc changes bumps the patch

BUG FIXES

ngram_collocations did not properly merge the quanteda outputs resulting in the length column being replicated multiple times. Additionally, length was integer whereas the other ngram measures are numeric resulting in a data.table warning in melt. Both of these issues have been addressed.
colo did not copy a single term to the clipboard with quotes. See issue #50.

NEW FEATURES

plot_upset added to enable exploration of overlapping instersections between term_count categories: http://caleydo.org/tools/upset.
get_text added to extract the original text associated with particular tags.
frequent_terms_co_occurrence added to view the co-occurrence between frequent terms. A combination of frequent_terms and tag_co_occurrence.
term_before, term_after, & term_first added to get frequencies of terms relative to other terms or specific locations.
token_count added to count the occurrence of tokens within a vector of strings. This function differs fromterm_count in that term_count is regex based, allowing for fuzzy matching. This function only searches for lower cased tokens (words, number sequences, or punctuation) providing a well defined counting function that is faster than term_count but less flexible.
as_term_list added. This is a convenience function to convert a vector of terms or a quanteda dictionary into a named list.
combine_counts added to enable combining term_count and token_count objects.
match_word added to match words to regular expressions. Roughly equivalent qdap's term_match.
read_term_list/write_term_list added to aid in the reading in/writing out and formatting of term list files.
classification_template added to manually add a classification script template. This template has a suggested termco based workflow that may be useful for classification projects.
test_regex added to test an atomic vector, list, or term list of regexes for validity.
mutate_counts added to apply a normalizing function to all the term columns of a term_count/token_count object without stripping the attributes and class.
drop_terms added to allow the user to explore/iterate on a term list and drop terms prior prior to \code{term_count} use without manually editing an external term list file.
tidy_counts added to converts a wide matrix of counts to tidy form (tags are stretched long-wise with corresponding counts of tags).
set_meta_tags added for setting the metatags attribute on a term_count/token_countobject. This can also be controlled by separators in the term/token list passed toterm_count/token_count.
select_counts added for safely selecting term_count/token_countobject columns without stripping attributes. Works like?dplyr::select`.

MINOR FEATURES

important_terms picks up a plot method corresponding to the frequent_terms plot method.
term_count checks for duplicate categories within tiers for hierarchical term lists.
read_term_list checks for valid regex.

IMPROVEMENTS

validate_model now uses classify before validating to assign tags.
tag_co_occurrence used a grid + base plotting approach that required restarting the graphics device between plots. This dependency has been replaces with a dependency on ggraph for plotting networks as grid objects.
plot.validate_model now shows tag counts in the sample to provide a relative importance of the accuracy in making decisions.
Open, unescaped or regexes [(i.e., |) unescaped pipe followed by a closing group character] are now caught and warned for read_term_list and thus term_count.
metatags is an official attribute that can be used to group common tags together. This is common in qualitative coding where one tags text and then groups these subtags together into coherent metatags. This is used by tidy_counts and can be used by other future features.

CHANGES

The stopwords package replaces the tm package for providing default stopword lists. The stopwords package is more comprehensive and lighter weight. This changes allows the removal of the tm package as a dependency. Suggested by Ken Benoit issue #69.
important_terms now uses quanteda::dfm_tfidf rather than tm::weightTfIdf. This means the tf-idf weighting is done is base 10 log rather than base 2 as done with the tm package. Suggested by Ken Benoit issue #69.
as_dtm & as_tdm moved to the gofastr package where they can be used by other packages and their classed objects. termco re-exports the two functions.
summary.validate_model used to return n which was the number of tags from the termco object. It now gives n.tags and n.classified to be more explicit about counts of potential tags and tags actually assigned by classify.
colo no longer uses non-standard evaluation; terms must be quoted.
ngram_collocations has been renamed to frequent_ngrams for better clarity in what the function does and as a counter part to frequent_terms.
update_names renamed to rename_tags to be consistent with naming conventions.
term_cols renamed to tag_cols to be consistent with naming conventions.
token_count has no print method of it's own any more. The print method for term_count was made more generic and works for both since token_count inheerits from term_count. This is easier to maintain.

NEW FEATURES

term_cols & group_cols added to quickly grab just term or grouping variable columns.
as_dtm & as_tdm added to convert a term_count object into a tm::DocumentTermMatrix or tm::TermDocumentMatrix object.
update_names added to allow for safe renaming of a term_count object's columns while also updating its attributes as well.
term_list_template added for generating and writing term list templates.

IMPROVEMENTS

classify picks up a new default ties.method type of "probabilities". This used the probability distribution from all tags assigned to randomly break ties based on that distribution.
term_count gets an auto-collapse feature for hierarchical term.lists with duplicate names. A message is printed telling the user this is happening. To get the hierarchical coverage use attributes(x2)[['pre_collapse_coverage']].
accuracy now uses standard model evaluation measures of macro/micro averaged accuracy, precision, and recall as outlined by Dan Jurafsky & Chris Manning. See https://www.youtube.com/watch?v=OwwdYHWRB5E&index=31&list=PL6397E4B26D00A269 for details on the methods.

CHANGES

plot.tag_co_occurrence uses a bubble-dotplot for the right hand graph rather than the older bar plot. This allows for tag size to be displayed in addition to average number of other tags to determine if the tag co-occurrence is a meaningful number of tags to give additional attention to. Use tag = TRUE for the old behavior.
accuracy was renamed to evaluate to be more informative as well as a verb.

BUG FIXES

colo returned list rather than string if a single term was passed. Spotted by Steve Simpson. See issue #12.
term_count did not handle hierarchical term.list correctly due to a reordering done by data.table (when group.vars not = TRUE). This has been corrected.
Column ordering was not respected by print.term_count.
colo did not copy to the clip board when copy2clip was TRUE and a single expression was passed to ....

NEW FEATURES

important_terms added to compliment frequent_terms allowing tf-idf weighted terms to rise to the top.
collapse_tags added to combine tags/columns from term_count object without stripping the term_count class and attributes.

MINOR FEATURES

plot_counts picks up a drop argument to enable terms not found (if x is a as_terms object created from a term_count object) to be retained in the bar plot. Suggested by Steve Simpson. See issue #18.

IMPROVEMENTS

colo automatically adds a group parenthesis around ... regexes to protect the grouping explicitly. This is useful when a regex used or pipes (|). This would create an unintended expression that was overly aggressive (see #20).

NEW FEATURES

validate_model and assign_validation_task added to allow for human assessment of how accurate a model is functioning.

CHANGES

probe_colo_list,probe_colo_plot_list, & probe_colo_plot all use search_term_collocations under the hood rather than search_term + frequent_terms.

BUG FIXES

plot.term_count did not properly handle weighting. This has been fixed and allows for "count" as a choice.
search_term_which (also search_term) did not treat te and argument correctly. and was treated identical to the not argument.

NEW FEATURES

split_data added for easy creation of training and testing data.
classification_project added to make a classification modeling project template.
plot_cum_percent added for cumulative percent plot of frequent terms.
probe_ family of functions added to easily make lists of function calls for exploration of the frequent terms in the context of the data. Functions include: probe_list, probe_colo_list, probe_colo_plot_list, & probe_colo_plot.
hierarchical_coverage added to allow exploration of the unique coverage of a text vector by a term after partitioning out the elements matched by previous terms.
tag_co_occurrence added to explore tag co-occurrences.
search_term_collocations added as a convenience wrapper for search_term
frequent_terms. (Thanks to Steve Simpson)

MINOR FEATURES

plot_freq picks up a size argument.

IMPROVEMENTS

term_count now can be used in a hierarchical fashion. A list of regexes can be passed and counted and then a second (or more) pass can be taken wit a new set of regexes on only those rows/text elements that were left untagged (count rowSums is zero). This is accomplished by passing a list of lists of regexes. Thanks to Steve Simpson for suggesting this feature.

This package is a small suite of functions used to count terms and substrings in strings.

trinker/termco documentation built on Jan. 7, 2022, 3:32 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com