desc <- suppressWarnings(readLines("DESCRIPTION"))
regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"
loc <- grep(regex, desc)
ver <- gsub(regex, "\\2", desc[loc])
verbadge <- sprintf('<a href="https://img.shields.io/badge/Version-%s-orange.svg"><img src="https://img.shields.io/badge/Version-%s-orange.svg" alt="Version"/></a></p>', ver, ver)
verbage <- ''
library(dplyr);library(termco);library(knitr)
knit_hooks$set(htmlcap = function(before, options, envir) {
  if(!before) {
    paste('<p class="caption"><b><em>',options$htmlcap,"</em></b></p>",sep="")
    }
    })
knitr::opts_knit$set(self.contained = TRUE, cache = FALSE)
knitr::opts_chunk$set(fig.path = "tools/figure/")

Project Status: Active - The project has reached a stable, usable state and is being actively developed. Build Status Coverage Status DOI

termco is a suite of functions used to count and find terms and substrings in strings. The tools can be used to build an expert rules, regular expression based text classification model. The package wraps the data.table and stringi packages to create fast data frame counts of regular expression terms and substrings.

Functions

The main function of termco is term_count. It is used to extract regex term counts by grouping variable(s) as well as to generate classification models.

Most of the functions count, search, plot terms, and covert between output types, while a few remaining functions are used to train, test and interpret models. Additionally, the probe_ family of function generate lists of function calls or plots for given search terms. The table below describes the functions, category of use, and their description:

| Function | Use Category | Description | |------------------------------|--------------|---------------------------------------------------| | term_count | count | Count regex term occurrence; modeling | | token_count | count | Count fixed token occurrence; modeling | | frequent_terms/all_words | count | Frequent terms | | important_terms | count | Important terms | | hierarchical_coverage_term | count | Unique coverage of a text vector by terms | | hierarchical_coverage_regex| count | Unique coverage of a text vector by regex | | frequent_ngrams | count | Weighted frequent ngram (2 & 3) collocations | | word_count | count | Count words | | term_before/term_after | count | Frequency of words before/after a regex term | | term_first | count | Frequency of words at the begining of strings | | colo | search | Regex output to find term collocations | | search_term | search | Search for regex terms | | match_word | search | Extract words from a text matching a regular expression | | search_term_collocations | search | Wrapper for search_term + frequent_terms | | classification_project | modeling | Make a classification modeling project template | | classification_template | modeling | Make a classification analysis script template | | as_dtm/as_tdm | modeling | Coerce term_count object into tm::DocumentTermMatrix/tm::TermDocumentMatrix | | split_data | modeling | Split data into train & test sets | | evaluate | modeling | Check accuracy of model against human coder | | classify | modeling | Assign n tags to text from a model | | get_text | modeling | Get the original text for model tags | | coverage | modeling | Coverage for term_count or search_term object | | uncovered/get_uncovered | modeling | Get the uncovered text from a model | | mutate_counts | modeling | Apply normalizing function to term count columns | | select_counts | modeling | Select columns without stripping count classes | | tag_co_occurrence | modeling | Explore co-occurrence of tags from a model | | validate_model/assign_validation_task | modeling | Human validation of a term_count model | | read_term_list | read/write | Read a term list from an external file | | write_term_list | read/write | Write a term list to an external file | | term_list_template | read/write | Write a term list template to an external file | | as_count | convert | Strip pretty printing from term_count object | | as_terms | convert | Convert a count matrix to list of term vectors | | as_term_list | convert | Convert a vector of terms into a named term list | | weight | convert | Weight a term_count object proportion/percent | | plot_ca | plot | Plot term_count object as 3-D correspondence analysis map | | plot_counts | plot | Horizontal bar plot of group counts | | plot_freq | plot | Vertical bar plot of frequencies of counts | | plot_cum_percent | plot | Plot frequent_terms object as cumulative percent | | probe_list | probe | Generate list of search_term function calls | | probe_colo_list | probe | Generate list of search_term_collocations function calls | | probe_colo_plot_list | probe | Generate list of search_term_collocationss + plot function calls | | probe_colo_plot | probe | Plot probe_colo_plot_list directly |

Installation

To download the development version of termco:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(
    "trinker/gofastr",
    "trinker/termco"
)

Contact

You are welcome to: submit suggestions and bug-reports at: https://github.com/trinker/termco/issues send a pull request on: https://github.com/trinker/termco/ * compose a friendly e-mail to: tyler.rinker@gmail.com

Examples

The following examples demonstrate some of the functionality of termco.

Load the Tools/Data

if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, ggplot2, termco)

data(presidential_debates_2012)

Build Counts Dataframe

discoure_markers <- list(
    response_cries = c("\\boh", "\\bah", "aha", "ouch", "yuk"),
    back_channels = c("uh[- ]huh", "uhuh", "yeah"),
    summons = "hey",
    justification = "because"
)

counts <- presidential_debates_2012 %>%
    with(term_count(dialogue, grouping.var = list(person, time), discoure_markers))

counts

Printing

print(counts, pretty = FALSE)
print(counts, zero.replace = "_")

Plotting

plot(counts)
plot(counts, labels=TRUE)
plot_ca(counts, FALSE)

Ngram Collocations

termco wraps the quanteda package to examine important ngram collocations. quanteda's collocation function provides measures of: "lambda", "z", and "frequency" to examine the strength of relationship between ngrams. termco adds stopword removal, min/max character filtering, and stemming to quanteda's collocation as well as a generic plot method.

x <- presidential_debates_2012[["dialogue"]]

frequent_ngrams(x)
frequent_ngrams(x, gram.length = 3)
frequent_ngrams(x, order.by = "lambda")

Collocation Plotting

plot(frequent_ngrams(x))
plot(frequent_ngrams(x), drop.redundant.yaxis.text = FALSE)
plot(frequent_ngrams(x, gram.length = 3))
plot(frequent_ngrams(x, order.by = "lambda"))

Converting to Document Term Matrix

Regular expression counts can be useful features in machine learning models. The tm package's DocumentTermMatrix is a popular data structure for machine learning in R. The as_dtm and as_tdm functions are useful for coercing the count data.table structure of a term_count object into a DocumentTermMatrix/TermDocumentMatrix. The result can be combined with token/word only DocumentTermMatrix structures using cbind & rbind.

as_dtm(markers)
cosine_distance <- function (x, ...) {
    x <- t(slam::as.simple_triplet_matrix(x))
    stats::as.dist(1 - slam::crossprod_simple_triplet_matrix(x)/(sqrt(slam::col_sums(x^2) %*% 
        t(slam::col_sums(x^2)))))
}


mod <- hclust(cosine_distance(as_dtm(markers)))
plot(mod)
rect.hclust(mod, k = 5, border = "red")

(clusters <- cutree(mod, 5))

Building an Expert Rules, Regex Classifier Model

Machine learning models of classification are great when you have known tags to train with because the model scales. Qualitative, expert based human coding is terrific for when you have no tagged data. However, when you have a larger, untagged data set the machine learning approaches have no outcome to learn from and the data is too large to classify by hand. One solution is to use a expert rules, regular expression approach that is somewhere between machine learning and hand coding. This is one solution for tagging larger, untagged data sets. Additionally, when each text element contains larger chunks of text, unsupervised clustering type algorithms such as k-means, non-negative matrix factorization, hierarchical clustering, or topic modeling may be of use for creating clusters that could be interpreted and treated as categories.

This example section highlights the types of function combinations and order for a typical expert rules classification. This task typically involves the combined use of available literature, close examinations of term usage within text, and researcher experience. Building a classifier model requires the researcher to build a list of regular expressions that map to a category or tag. Below I outline minimal work flow for classification.

Note that the user may want to begin with a classification model template that contains subdirectories and files for a classification project. The classification_project generates this template with a pre-populated 'classification.R' script that can guide the user through the modeling process. The directory tree looks like the following:

template
    |
    |   .Rproj
    |   
    +---models
    |       categories.R
    |       
    +---data
    +---output
    +---plots
    +---reports
    \---scripts
            01_data_cleaning.R
            02_classification.R

Load the Tools/Data

if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, ggplot2, termco)

data(presidential_debates_2012)

Splitting Data

Many classification techniques require the data to be split into a training and test set to allow the researcher to observe how a model will perform on a new data set. This also prevents over-fitting the data. The split_data function allows easy splitting of data.frame or vector data by integer or proportion. The function returns a named list of the data set into a train and test set. The printed view is a truncated version of the returned list with |... indicating there are additional observations.

set.seed(111)
(pres_deb_split <- split_data(presidential_debates_2012, .75))

The training set can be accessed via pres_deb_split$train; likewise, the test set can be accessed by way of pres_deb_split$test.

Here I show splitting by integer.

split_data(presidential_debates_2012, 100)
set.seed(112)
(pres_deb_split <- split_data(presidential_debates_2012, 100))

I could have trained on the training set and tested on the testing set in the following examples around modeling but have chosen not to for simplicity.

Understanding Term Use

In order to build the named list of regular expressions that map to a category/tag the researcher must understand the terms (particularly information salient terms) in context. The understanding of term use helps the researcher to begin to build a mental model of the topics being used in a fashion similar to qualitative coding techniques. Broad categories will begin to coalesce as word use is elucidated. It forms the initial names of the "named list of regular expressions". Of course building the regular expressions in the regex model building step will allow the researcher to see new ways in which terms are used as well as new important terms. This in turn will reshape, remove, and add names to the "named list of regular expressions". This recursive process is captured in the model below.

model

View Most Used Words

A common task in building a model is to understand the most frequent words while excluding less information rich function words. The frequnt_terms function produces an ordered data frame of counts. The researcher can exclude stop words and limit the terms to contain n characters between set thresholds. The output is ordered by most to least frequent n terms but can be rearranged alphabetically.

presidential_debates_2012 %>%
    with(frequent_terms(dialogue))
presidential_debates_2012 %>%
    with(frequent_terms(dialogue, 40)) %>%
    plot()

A cumulative percent can give a different view of the term usage. The plot_cum_percent function converts a frequent_terms output into a cumulative percent plot. Additionally, frequent_ngrams + plot can give insight into the frequently occurring ngrams.

presidential_debates_2012 %>%
    with(frequent_terms(dialogue, 40)) %>%
    plot_cum_percent()

It may also be helpful to view the unique contribution of terms on the coverage excluding all elements from the match vector that were previously matched by another term. The hierarchical_coverage_term and accompanying plot method allows for hierarchical exploration of the unique coverage of terms.

terms <- presidential_debates_2012 %>%
    with(frequent_terms(dialogue, 30)) %>%
    `[[`("term")

presidential_debates_2012 %>%
    with(hierarchical_coverage_term(dialogue, terms))

presidential_debates_2012 %>%
    with(hierarchical_coverage_term(dialogue, terms)) %>%
    plot(use.terms = TRUE)

View Most Used Words in Context

Much of the exploration of terms in context in effort to build the named list of regular expressions that map to a category/tag involves recursive views of frequent terms in context. The probe family of functions can generate lists of function calls (and copy them to the clipboard for easy transfer) allowing the user to circulate through term lists generated from other termco tools such as frequent_terms. This is meant to standardize and speed up the process.

The first probe_ tool makes a list of function calls for search_term using a term list. Here I show just 10 terms from frequent_terms. This can be pasted into a script and then run line by line to explore the frequent terms in context.

presidential_debates_2012 %>%
    with(frequent_terms(dialogue, 10)) %>%
    select(term) %>%
    unlist() %>%
    probe_list("presidential_debates_2012$dialogue") 

The next probe_ function generates a list of search_term_collocations function calls (search_term_collocations wraps search_term with frequent_terms and eliminates the search term from the output). This allows the user to systematically explore the words that frequently collocate with the original terms.

presidential_debates_2012 %>%
    with(frequent_terms(dialogue, 5)) %>%
    select(term) %>%
    unlist() %>%
    probe_colo_list("presidential_debates_2012$dialogue") 

As search_term_collocations has a plot method the user may wish to generate function calls similar to probe_colo_list but wrapped with plot for a visual exploration of the data. The probe_colo_plot_list makes a list of such function calls, whereas the probe_colo_plot plots the output directly to a single external .pdf file.

presidential_debates_2012 %>%
    with(frequent_terms(dialogue, 5)) %>%
    select(term) %>%
    unlist() %>%
    probe_colo_plot_list("presidential_debates_2012$dialogue") 

The plots can be generated externally with the probe_colo_plot function which makes multi-page .pdf of frequent terms bar plots; one plot for each term.

presidential_debates_2012 %>%
    with(frequent_terms(dialogue, 5)) %>%
    select(term) %>%
    unlist() %>%
    probe_colo_plot("presidential_debates_2012$dialogue") 

View Important Words

It may also be useful to view top min-max scaled tf-idf weighted terms to allow the more information rich terms to bubble to the top. The important_terms function allows the user to do exactly this. The function works similar to term_count but with an information weight.

presidential_debates_2012 %>%
    with(important_terms(dialogue, 10))

Building the Model

To build a model the researcher created a named list of regular expressions that map to a category/tag. This is fed to the term_count function. term_count allows for aggregation by grouping variables but for building the model we usually want to get observation level counts. Set grouping.var = TRUE to generate an id column of 1 through number of observation which gives the researcher the observation level counts.

discoure_markers <- list(
    response_cries = c("\\boh", "\\bah", "aha", "ouch", "yuk"),
    back_channels = c("uh[- ]huh", "uhuh", "yeah"),
    summons = "hey",
    justification = "because"
)

model <- presidential_debates_2012 %>%
    with(term_count(dialogue, grouping.var = TRUE, discoure_markers))

model

Testing the Model

In building a classifier the researcher is typically concerned with coverage, discrimination, and accuracy. The first two are easier to obtain while accuracy is not possible to compute without a comparison sample of expertly tagged data.

We want our model to be assigning tags to as many of the text elements as possible. The coverage function can provide an understanding of what percent of the data is tagged. Our model has relatively low coverage, indicating the regular expression model needs to be improved.

model %>%
    coverage()

Understanding how well our model discriminates is important as well. We want the model to cover as close to 100% of the data as possible, but likely want fewer tags assigned to each element. If the model is tagging many tags to each element it is not able to discriminate well. The as_terms + plot_freq function provides a visual representation of the model's ability to discriminate. The output is a bar plot showing the distribution of the number of tags at the element level. The goal is to have a larger density at $1$ tag. Note that the plot also gives a view of coverage, as the zero bar shows the frequency of elements that could not be tagged. Our model has a larger distribution of $1$ tag compared to the $> 1$ tag distributions, though the coverage is very poor. As the number of tags increases the ability of the model to discriminate typically lessens. There is often a trade off between model coverage and discrimination.

model %>%
    as_terms() %>%
    plot_freq(size=3) + xlab("Number of Tags")

We may also want to see the distribution of the tags as well. The combination of as_terms + plot_counts gives the distribution of the tags. In our model the majority of tags are applied to the summons category.

model %>%
    as_terms() %>%
    plot_counts() + xlab("Tags")

Improving the Model

Improving Coverage

The model does not have very good coverage. To improve this the researcher will want to look at the data with no coverage to try to build additional regular expressions and categories. This requires understanding language, noticing additional features of the data with no coverage that may map to categories, and building regular expressions to model these features. This section will outline some of the tools that can be used to detect features and build regular expressions to model these language features.

We first want to view the untagged data. The uncovered function provides a logical vector that can be used to extract the text with no tags.

untagged <- get_uncovered(model)

head(untagged)

The frequent_terms function can be used again to understand common features of the untagged data.

untagged %>%
    frequent_terms()

We may see a common term such as the word right and want to see what other terms collocate with it. Using a regular expression that searches for multiple terms can improve a model's accuracy and ability to discriminate. Using search_term in combination with frequent_terms can be a powerful way to see which words tend to collocate. Here I pass a regex for right (\\bright) to search_term. This pulls up the text that contains this term. I then use frequent_terms to see what words frequently occur with the word right. We notice the word people tends to occur with right.

untagged %>%
    search_term("\\bright") %>%
    frequent_terms(10, stopwords = "right")

The search_term_collocations function provides a convenient wrapper for search_term + frequent_terms which also removes the search term from the output.

untagged %>%
    search_term_collocations("\\bright", n=10)

This is an exploratory act. Finding the right combination of features that occur together requires lots of recursive noticing, trialling, testing, reading, interpreting, and deciding. After we noticed that the terms people and course appear with the term right above we will want to see these text elements. We can use a grouped-or expression with colo to build a regular expression that will search for any text elements that contain these two terms anywhere. colo is more powerful than initially shown here; I demonstrate further functionality below. Here is the regex produced.

colo("\\bright", "(people|course)")

This is extremely powerful when used inside of search_term as the text containing this regular expression will be returned along with the coverage proportion on the uncovered data.

search_term(untagged, colo("\\bright", "(people|course)"))

We notice right away that the phrase right course appears often. We can create a search with just this expression.

Note that the decision to include a regular expression in the model is up to the researcher. We must guard against over-fitting the model, making it not transferable to new, similar contexts.

search_term(untagged, "right course")

Based on the frequent_terms output above, the word jobs also seems important. Again, we use the search_term + frequent_terms combo to extract words collocating with jobs.

search_term_collocations(untagged, "jobs", n=15)

As stated above, colo is a powerful search tool as it can take multiple regular expressions as well as allowing for multiple negations (i.e., find x but not if y). To include multiple negations use a grouped-or regex as shown below.

## Where do `jobs` and `create` collocate?
search_term(untagged, colo("jobs", "create")) 
## Where do `jobs`, `create`,  and the word `not` collocate?
search_term(untagged, colo("jobs", "create", "(not|'nt)")) 
## Where do `jobs` and`create` collocate without a `not` word?
search_term(untagged, colo("jobs", "create", not = "(not|'nt)")) 
## Where do `jobs`, `romney`, and `create` collocate?
search_term(untagged, colo("jobs", "create", "romney")) 

Here is one more example with colo for the words jobs and overseas. The user may want to quickly test and then transfer the regex created by colo to the regular expression list. By setting options(termco.copy2clip = TRUE) the user globally sets colo to use the clipr package to copy the regex to the clipboard for better work flow.

search_term(untagged, colo("jobs", "overseas")) 

The researcher uses an iterative process to continue to build the regular expression list. The term_count function builds the matrix of counts to further test the model. The use of (a) coverage, (b) as_terms + plot_counts, and (c) as_terms + freq_counts will allow for continued testing of model functioning.

Improving Discrimination

It is often desirable to improve discrimination. While the bar plot highlighting the distribution of the number of tags is useful, it only indicates if there is a problem, not where the problem lies. The tag_co_occurrence function produces a list of data.frame and matrices that aide in understanding how to improve discrimination. This list is useful, but the plot method provides an improved visual view of the co-occurrences of tags.

The network plot on the left shows the strength of relationships between tags, while the plot on the right shows the average number of other tags that co-occur with each regex tag. In this particular case the plot combo is not complex because of the limited number of regex tags. Note that the edge strength is relative to all other edges. The strength has to be considered in the context of the average number of other tags that co-occur with each regex tag bar/dot plot on the right. As the number of tags increases the plot increases in complexity. The unconnected nodes and shorter bars represent the tags that provide the best discriminatory power, whereas the other tags have the potential to be redundant.

tag_co_occurrence(model) %>%
    plot(min.edge.cutoff = .01)

Another way to view the overlapping complexity and relationships between tags is to use an Upset plot. The plot_upset function wraps UpSetR::upset and is made to handle term_count objects directly. Upset plots are complex and require study of the method in order to interpret the results (http://caleydo.org/tools/upset). The time invested in learning this plot type can be very fruitful in utilizing a technique that scales to the types of data sets that termco outputs. This tool can be useful in order to understand overlap and thus improve discrimination.

plot_upset(model) 

Categorizing/Tagging

The classify function enables the researcher to apply $n$ tags to each text element. Depending on the text and the regular expression list's ability, multiple tags may be applied to a text. The n argument allows the maximum number of tags to be set though the function does not guarantee this many (or any) tags will be assigned.

Here I show the head of the returned vector (if n > 1 a list may be returned) as well as a table and plot of the counts. Use n = Inf to return all tags.

classify(model) %>%
    head()
classify(model) %>%
    unlist() %>%
    table()
classify(model) %>%
    unlist() %>%
    plot_counts() + xlab("Tags")

Evaluation: Accuracy

Pre Coded Data

The evaluate function is a more formal method of evaluation than validate_model. The evaluate function yields a test a model's accuracy, precision, and recall using macro and micro averages of the confusion matrices for each tag as outlined by Dan Jurafsky & Chris Manning. The function requires a known, human coded sample. In the example below I randomly generate "known human coded tagged" vector. Obviously, this is for demonstration purposes. The model outputs a pretty printing of a list. Note that if a larger, known tagging set of data is available the user may want to strongly consider machine learning models (see: RTextTools).

This minimal example will provide insight into the way the evaluate scores behave:

known <- list(1:3, 3, NA, 4:5, 2:4, 5, integer(0))
tagged <- list(1:3, 3, 4, 5:4, c(2, 4:3), 5, integer(0))
evaluate(tagged, known)

Below we create fake "known" tags to test evaluate with real data (though the comparison is fabricated).

mod1 <- presidential_debates_2012 %>%
    with(term_count(dialogue, TRUE, discoure_markers)) %>%
    classify()

fake_known <- mod1
set.seed(1)
fake_known[sample(1:length(fake_known), 300)] <- "random noise"

evaluate(mod1, fake_known)

Post Coding Data

It is often useful to less formally, validate a model via human evaluation; checking that text is being tagged as expected. This approach is more formative and less rigorous than evaluate, intended to be used to assess model functioning in order to improve it. The validate_model provides an interactive interface for a single evaluator to sample n tags and corresponding texts and assess the accuracy of the tag to the text. The assign_validation_task generates an external file(s) for n coders for redundancy of code assessments. This may be of use in Mechanical Turk type applications. The example below demonstrates validate_model's print/summary and plot outputs.

validated <- model %>%
    validate_model()
data(validated)

After validate_model has been run the print/summary and plot provides an accuracy of each tag and a confidence level (note that the confidence band is highly affected by the number of samples per tag).

validated
plot(validated)

These examples give guidance on how to use the tools in the termco package to build an expert rules, regular expression text classification model.



trinker/termco documentation built on Jan. 7, 2022, 3:32 a.m.