desc <- suppressWarnings(readLines("DESCRIPTION")) regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)" loc <- grep(regex, desc) ver <- gsub(regex, "\\2", desc[loc]) verbadge <- sprintf('<a href="https://img.shields.io/badge/Version-%s-orange.svg"><img src="https://img.shields.io/badge/Version-%s-orange.svg" alt="Version"/></a></p>', ver, ver) verbage <- '' library(dplyr);library(termco);library(knitr)
knit_hooks$set(htmlcap = function(before, options, envir) { if(!before) { paste('<p class="caption"><b><em>',options$htmlcap,"</em></b></p>",sep="") } }) knitr::opts_knit$set(self.contained = TRUE, cache = FALSE) knitr::opts_chunk$set(fig.path = "tools/figure/")
termco is a suite of functions used to count and find terms and substrings in strings. The tools can be used to build an expert rules, regular expression based text classification model. The package wraps the data.table and stringi packages to create fast data frame counts of regular expression terms and substrings.
The main function of termco is term_count
. It is used to extract regex term counts by grouping variable(s) as well as to generate classification models.
Most of the functions count, search, plot terms, and covert between output types, while a few remaining functions are used to train, test and interpret models. Additionally, the probe_
family of function generate lists of function calls or plots for given search terms. The table below describes the functions, category of use, and their description:
| Function | Use Category | Description |
|------------------------------|--------------|---------------------------------------------------|
| term_count
| count | Count regex term occurrence; modeling |
| token_count
| count | Count fixed token occurrence; modeling |
| frequent_terms
/all_words
| count | Frequent terms |
| important_terms
| count | Important terms |
| hierarchical_coverage_term
| count | Unique coverage of a text vector by terms |
| hierarchical_coverage_regex
| count | Unique coverage of a text vector by regex |
| frequent_ngrams
| count | Weighted frequent ngram (2 & 3) collocations |
| word_count
| count | Count words |
| term_before
/term_after
| count | Frequency of words before/after a regex term |
| term_first
| count | Frequency of words at the begining of strings |
| colo
| search | Regex output to find term collocations |
| search_term
| search | Search for regex terms |
| match_word
| search | Extract words from a text matching a regular expression |
| search_term_collocations
| search | Wrapper for search_term
+ frequent_terms
|
| classification_project
| modeling | Make a classification modeling project template |
| classification_template
| modeling | Make a classification analysis script template |
| as_dtm
/as_tdm
| modeling | Coerce term_count
object into tm::DocumentTermMatrix
/tm::TermDocumentMatrix
|
| split_data
| modeling | Split data into train
& test
sets |
| evaluate
| modeling | Check accuracy of model against human coder |
| classify
| modeling | Assign n tags to text from a model |
| get_text
| modeling | Get the original text for model tags |
| coverage
| modeling | Coverage for term_count
or search_term
object |
| uncovered
/get_uncovered
| modeling | Get the uncovered text from a model |
| mutate_counts
| modeling | Apply normalizing function to term count columns |
| select_counts
| modeling | Select columns without stripping count classes |
| tag_co_occurrence
| modeling | Explore co-occurrence of tags from a model |
| validate_model
/assign_validation_task
| modeling | Human validation of a term_count
model |
| read_term_list
| read/write | Read a term list from an external file |
| write_term_list
| read/write | Write a term list to an external file |
| term_list_template
| read/write | Write a term list template to an external file |
| as_count
| convert | Strip pretty printing from term_count
object |
| as_terms
| convert | Convert a count matrix to list of term vectors |
| as_term_list
| convert | Convert a vector of terms into a named term list |
| weight
| convert | Weight a term_count
object proportion/percent |
| plot_ca
| plot | Plot term_count
object as 3-D correspondence analysis map |
| plot_counts
| plot | Horizontal bar plot of group counts |
| plot_freq
| plot | Vertical bar plot of frequencies of counts |
| plot_cum_percent
| plot | Plot frequent_terms
object as cumulative percent |
| probe_list
| probe | Generate list of search_term
function calls |
| probe_colo_list
| probe | Generate list of search_term_collocations
function calls |
| probe_colo_plot_list
| probe | Generate list of search_term_collocationss
+ plot
function calls |
| probe_colo_plot
| probe | Plot probe_colo_plot_list
directly |
To download the development version of termco:
Download the zip ball or tar ball, decompress and run R CMD INSTALL
on it, or use the pacman package to install the development version:
if (!require("pacman")) install.packages("pacman") pacman::p_load_gh( "trinker/gofastr", "trinker/termco" )
You are welcome to: submit suggestions and bug-reports at: https://github.com/trinker/termco/issues send a pull request on: https://github.com/trinker/termco/ * compose a friendly e-mail to: tyler.rinker@gmail.com
The following examples demonstrate some of the functionality of termco.
if (!require("pacman")) install.packages("pacman") pacman::p_load(dplyr, ggplot2, termco) data(presidential_debates_2012)
discoure_markers <- list( response_cries = c("\\boh", "\\bah", "aha", "ouch", "yuk"), back_channels = c("uh[- ]huh", "uhuh", "yeah"), summons = "hey", justification = "because" ) counts <- presidential_debates_2012 %>% with(term_count(dialogue, grouping.var = list(person, time), discoure_markers)) counts
print(counts, pretty = FALSE) print(counts, zero.replace = "_")
plot(counts) plot(counts, labels=TRUE) plot_ca(counts, FALSE)
termco wraps the quanteda package to examine important ngram collocations. quanteda's collocation
function provides measures of: "lambda"
, "z"
, and "frequency"
to examine the strength of relationship between ngrams. termco adds stopword removal, min/max character filtering, and stemming to quanteda's collocation
as well as a generic plot
method.
x <- presidential_debates_2012[["dialogue"]] frequent_ngrams(x) frequent_ngrams(x, gram.length = 3) frequent_ngrams(x, order.by = "lambda")
plot(frequent_ngrams(x)) plot(frequent_ngrams(x), drop.redundant.yaxis.text = FALSE) plot(frequent_ngrams(x, gram.length = 3)) plot(frequent_ngrams(x, order.by = "lambda"))
Regular expression counts can be useful features in machine learning models. The tm package's DocumentTermMatrix
is a popular data structure for machine learning in R. The as_dtm
and as_tdm
functions are useful for coercing the count data.table
structure of a term_count
object into a DocumentTermMatrix
/TermDocumentMatrix
. The result can be combined with token/word only DocumentTermMatrix
structures using cbind
& rbind
.
as_dtm(markers) cosine_distance <- function (x, ...) { x <- t(slam::as.simple_triplet_matrix(x)) stats::as.dist(1 - slam::crossprod_simple_triplet_matrix(x)/(sqrt(slam::col_sums(x^2) %*% t(slam::col_sums(x^2))))) } mod <- hclust(cosine_distance(as_dtm(markers))) plot(mod) rect.hclust(mod, k = 5, border = "red") (clusters <- cutree(mod, 5))
Machine learning models of classification are great when you have known tags to train with because the model scales. Qualitative, expert based human coding is terrific for when you have no tagged data. However, when you have a larger, untagged data set the machine learning approaches have no outcome to learn from and the data is too large to classify by hand. One solution is to use a expert rules, regular expression approach that is somewhere between machine learning and hand coding. This is one solution for tagging larger, untagged data sets. Additionally, when each text element contains larger chunks of text, unsupervised clustering type algorithms such as k-means, non-negative matrix factorization, hierarchical clustering, or topic modeling may be of use for creating clusters that could be interpreted and treated as categories.
This example section highlights the types of function combinations and order for a typical expert rules classification. This task typically involves the combined use of available literature, close examinations of term usage within text, and researcher experience. Building a classifier model requires the researcher to build a list of regular expressions that map to a category or tag. Below I outline minimal work flow for classification.
Note that the user may want to begin with a classification model template that contains subdirectories and files for a classification project. The classification_project
generates this template with a pre-populated 'classification.R' script that can guide the user through the modeling process. The directory tree looks like the following:
template | | .Rproj | +---models | categories.R | +---data +---output +---plots +---reports \---scripts 01_data_cleaning.R 02_classification.R
if (!require("pacman")) install.packages("pacman") pacman::p_load(dplyr, ggplot2, termco) data(presidential_debates_2012)
Many classification techniques require the data to be split into a training and test set to allow the researcher to observe how a model will perform on a new data set. This also prevents over-fitting the data. The split_data
function allows easy splitting of data.frame
or vector
data by integer or proportion. The function returns a named list of the data set into a train
and test
set. The printed view is a truncated version of the returned list with |...
indicating there are additional observations.
set.seed(111) (pres_deb_split <- split_data(presidential_debates_2012, .75))
The training set can be accessed via pres_deb_split$train
; likewise, the test set can be accessed by way of pres_deb_split$test
.
Here I show splitting by integer.
split_data(presidential_debates_2012, 100)
set.seed(112) (pres_deb_split <- split_data(presidential_debates_2012, 100))
I could have trained on the training set and tested on the testing set in the following examples around modeling but have chosen not to for simplicity.
In order to build the named list of regular expressions that map to a category/tag the researcher must understand the terms (particularly information salient terms) in context. The understanding of term use helps the researcher to begin to build a mental model of the topics being used in a fashion similar to qualitative coding techniques. Broad categories will begin to coalesce as word use is elucidated. It forms the initial names of the "named list of regular expressions". Of course building the regular expressions in the regex model building step will allow the researcher to see new ways in which terms are used as well as new important terms. This in turn will reshape, remove, and add names to the "named list of regular expressions". This recursive process is captured in the model below.
A common task in building a model is to understand the most frequent words while excluding less information rich function words. The frequnt_terms
function produces an ordered data frame of counts. The researcher can exclude stop words and limit the terms to contain n characters between set thresholds. The output is ordered by most to least frequent n terms but can be rearranged alphabetically.
presidential_debates_2012 %>% with(frequent_terms(dialogue)) presidential_debates_2012 %>% with(frequent_terms(dialogue, 40)) %>% plot()
A cumulative percent can give a different view of the term usage. The plot_cum_percent
function converts a frequent_terms
output into a cumulative percent plot. Additionally, frequent_ngrams
+ plot
can give insight into the frequently occurring ngrams.
presidential_debates_2012 %>% with(frequent_terms(dialogue, 40)) %>% plot_cum_percent()
It may also be helpful to view the unique contribution of terms on the coverage excluding all elements from the match vector that were previously matched by another term. The hierarchical_coverage_term
and accompanying plot
method allows for hierarchical exploration of the unique coverage of terms.
terms <- presidential_debates_2012 %>% with(frequent_terms(dialogue, 30)) %>% `[[`("term") presidential_debates_2012 %>% with(hierarchical_coverage_term(dialogue, terms)) presidential_debates_2012 %>% with(hierarchical_coverage_term(dialogue, terms)) %>% plot(use.terms = TRUE)
Much of the exploration of terms in context in effort to build the named list of regular expressions that map to a category/tag involves recursive views of frequent terms in context. The probe
family of functions can generate lists of function calls (and copy them to the clipboard for easy transfer) allowing the user to circulate through term lists generated from other termco tools such as frequent_terms
. This is meant to standardize and speed up the process.
The first probe_
tool makes a list of function calls for search_term
using a term list. Here I show just 10 terms from frequent_terms
. This can be pasted into a script and then run line by line to explore the frequent terms in context.
presidential_debates_2012 %>% with(frequent_terms(dialogue, 10)) %>% select(term) %>% unlist() %>% probe_list("presidential_debates_2012$dialogue")
The next probe_
function generates a list of search_term_collocations
function calls (search_term_collocations
wraps search_term
with frequent_terms
and eliminates the search term from the output). This allows the user to systematically explore the words that frequently collocate with the original terms.
presidential_debates_2012 %>% with(frequent_terms(dialogue, 5)) %>% select(term) %>% unlist() %>% probe_colo_list("presidential_debates_2012$dialogue")
As search_term_collocations
has a plot
method the user may wish to generate function calls similar to probe_colo_list
but wrapped with plot
for a visual exploration of the data. The probe_colo_plot_list
makes a list of such function calls, whereas the probe_colo_plot
plots the output directly to a single external .pdf file.
presidential_debates_2012 %>% with(frequent_terms(dialogue, 5)) %>% select(term) %>% unlist() %>% probe_colo_plot_list("presidential_debates_2012$dialogue")
The plots can be generated externally with the probe_colo_plot
function which makes multi-page .pdf of frequent terms bar plots; one plot for each term.
presidential_debates_2012 %>% with(frequent_terms(dialogue, 5)) %>% select(term) %>% unlist() %>% probe_colo_plot("presidential_debates_2012$dialogue")
It may also be useful to view top min-max scaled tf-idf weighted terms to allow the more information rich terms to bubble to the top. The important_terms
function allows the user to do exactly this. The function works similar to term_count
but with an information weight.
presidential_debates_2012 %>% with(important_terms(dialogue, 10))
To build a model the researcher created a named list of regular expressions that map to a category/tag. This is fed to the term_count
function. term_count
allows for aggregation by grouping variables but for building the model we usually want to get observation level counts. Set grouping.var = TRUE
to generate an id
column of 1 through number of observation which gives the researcher the observation level counts.
discoure_markers <- list( response_cries = c("\\boh", "\\bah", "aha", "ouch", "yuk"), back_channels = c("uh[- ]huh", "uhuh", "yeah"), summons = "hey", justification = "because" ) model <- presidential_debates_2012 %>% with(term_count(dialogue, grouping.var = TRUE, discoure_markers)) model
In building a classifier the researcher is typically concerned with coverage, discrimination, and accuracy. The first two are easier to obtain while accuracy is not possible to compute without a comparison sample of expertly tagged data.
We want our model to be assigning tags to as many of the text elements as possible. The coverage
function can provide an understanding of what percent of the data is tagged. Our model has relatively low coverage, indicating the regular expression model needs to be improved.
model %>% coverage()
Understanding how well our model discriminates is important as well. We want the model to cover as close to 100% of the data as possible, but likely want fewer tags assigned to each element. If the model is tagging many tags to each element it is not able to discriminate well. The as_terms
+ plot_freq
function provides a visual representation of the model's ability to discriminate. The output is a bar plot showing the distribution of the number of tags at the element level. The goal is to have a larger density at $1$ tag. Note that the plot also gives a view of coverage, as the zero bar shows the frequency of elements that could not be tagged. Our model has a larger distribution of $1$ tag compared to the $> 1$ tag distributions, though the coverage is very poor. As the number of tags increases the ability of the model to discriminate typically lessens. There is often a trade off between model coverage and discrimination.
model %>% as_terms() %>% plot_freq(size=3) + xlab("Number of Tags")
We may also want to see the distribution of the tags as well. The combination of as_terms
+ plot_counts
gives the distribution of the tags. In our model the majority of tags are applied to the summons category.
model %>% as_terms() %>% plot_counts() + xlab("Tags")
The model does not have very good coverage. To improve this the researcher will want to look at the data with no coverage to try to build additional regular expressions and categories. This requires understanding language, noticing additional features of the data with no coverage that may map to categories, and building regular expressions to model these features. This section will outline some of the tools that can be used to detect features and build regular expressions to model these language features.
We first want to view the untagged data. The uncovered
function provides a logical vector that can be used to extract the text with no tags.
untagged <- get_uncovered(model) head(untagged)
The frequent_terms
function can be used again to understand common features of the untagged data.
untagged %>% frequent_terms()
We may see a common term such as the word right and want to see what other terms collocate with it. Using a regular expression that searches for multiple terms can improve a model's accuracy and ability to discriminate. Using search_term
in combination with frequent_terms
can be a powerful way to see which words tend to collocate. Here I pass a regex for right (\\bright
) to search_term
. This pulls up the text that contains this term. I then use frequent_terms
to see what words frequently occur with the word right. We notice the word people tends to occur with right.
untagged %>% search_term("\\bright") %>% frequent_terms(10, stopwords = "right")
The search_term_collocations
function provides a convenient wrapper for search_term
+ frequent_terms
which also removes the search term from the output.
untagged %>% search_term_collocations("\\bright", n=10)
This is an exploratory act. Finding the right combination of features that occur together requires lots of recursive noticing, trialling, testing, reading, interpreting, and deciding. After we noticed that the terms people and course appear with the term right above we will want to see these text elements. We can use a grouped-or expression with colo
to build a regular expression that will search for any text elements that contain these two terms anywhere. colo
is more powerful than initially shown here; I demonstrate further functionality below. Here is the regex produced.
colo("\\bright", "(people|course)")
This is extremely powerful when used inside of search_term
as the text containing this regular expression will be returned along with the coverage proportion on the uncovered data.
search_term(untagged, colo("\\bright", "(people|course)"))
We notice right away that the phrase right course appears often. We can create a search with just this expression.
Note that the decision to include a regular expression in the model is up to the researcher. We must guard against over-fitting the model, making it not transferable to new, similar contexts.
search_term(untagged, "right course")
Based on the frequent_terms
output above, the word jobs also seems important. Again, we use the search_term
+ frequent_terms
combo to extract words collocating with jobs.
search_term_collocations(untagged, "jobs", n=15)
As stated above, colo
is a powerful search tool as it can take multiple regular expressions as well as allowing for multiple negations (i.e., find x but not if y). To include multiple negations use a grouped-or regex as shown below.
## Where do `jobs` and `create` collocate? search_term(untagged, colo("jobs", "create")) ## Where do `jobs`, `create`, and the word `not` collocate? search_term(untagged, colo("jobs", "create", "(not|'nt)")) ## Where do `jobs` and`create` collocate without a `not` word? search_term(untagged, colo("jobs", "create", not = "(not|'nt)")) ## Where do `jobs`, `romney`, and `create` collocate? search_term(untagged, colo("jobs", "create", "romney"))
Here is one more example with colo
for the words jobs and overseas. The user may want to quickly test and then transfer the regex created by colo
to the regular expression list. By setting options(termco.copy2clip = TRUE)
the user globally sets colo
to use the clipr package to copy the regex to the clipboard for better work flow.
search_term(untagged, colo("jobs", "overseas"))
The researcher uses an iterative process to continue to build the regular expression list. The term_count
function builds the matrix of counts to further test the model. The use of (a) coverage
, (b) as_terms
+ plot_counts
, and (c) as_terms
+ freq_counts
will allow for continued testing of model functioning.
It is often desirable to improve discrimination. While the bar plot highlighting the distribution of the number of tags is useful, it only indicates if there is a problem, not where the problem lies. The tag_co_occurrence
function produces a list of data.frame
and matrices
that aide in understanding how to improve discrimination. This list is useful, but the plot
method provides an improved visual view of the co-occurrences of tags.
The network plot on the left shows the strength of relationships between tags, while the plot on the right shows the average number of other tags that co-occur with each regex tag. In this particular case the plot combo is not complex because of the limited number of regex tags. Note that the edge strength is relative to all other edges. The strength has to be considered in the context of the average number of other tags that co-occur with each regex tag bar/dot plot on the right. As the number of tags increases the plot increases in complexity. The unconnected nodes and shorter bars represent the tags that provide the best discriminatory power, whereas the other tags have the potential to be redundant.
tag_co_occurrence(model) %>% plot(min.edge.cutoff = .01)
Another way to view the overlapping complexity and relationships between tags is to use an Upset plot. The plot_upset
function wraps UpSetR::upset
and is made to handle term_count
objects directly. Upset plots are complex and require study of the method in order to interpret the results (http://caleydo.org/tools/upset). The time invested in learning this plot type can be very fruitful in utilizing a technique that scales to the types of data sets that termco outputs. This tool can be useful in order to understand overlap and thus improve discrimination.
plot_upset(model)
The classify
function enables the researcher to apply $n$ tags to each text element. Depending on the text and the regular expression list's ability, multiple tags may be applied to a text. The n
argument allows the maximum number of tags to be set though the function does not guarantee this many (or any) tags will be assigned.
Here I show the head
of the returned vector (if n
> 1 a list
may be returned) as well as a table
and plot of the counts. Use n = Inf
to return all tags.
classify(model) %>% head() classify(model) %>% unlist() %>% table() classify(model) %>% unlist() %>% plot_counts() + xlab("Tags")
The evaluate
function is a more formal method of evaluation than validate_model
. The evaluate
function yields a test a model's accuracy, precision, and recall using macro and micro averages of the confusion matrices for each tag as outlined by Dan Jurafsky & Chris Manning. The function requires a known, human coded sample. In the example below I randomly generate "known human coded tagged" vector. Obviously, this is for demonstration purposes. The model outputs a pretty printing of a list. Note that if a larger, known tagging set of data is available the user may want to strongly consider machine learning models (see: RTextTools).
This minimal example will provide insight into the way the evaluate scores behave:
known <- list(1:3, 3, NA, 4:5, 2:4, 5, integer(0)) tagged <- list(1:3, 3, 4, 5:4, c(2, 4:3), 5, integer(0)) evaluate(tagged, known)
Below we create fake "known" tags to test evaluate
with real data (though the comparison is fabricated).
mod1 <- presidential_debates_2012 %>% with(term_count(dialogue, TRUE, discoure_markers)) %>% classify() fake_known <- mod1 set.seed(1) fake_known[sample(1:length(fake_known), 300)] <- "random noise" evaluate(mod1, fake_known)
It is often useful to less formally, validate a model via human evaluation; checking that text is being tagged as expected. This approach is more formative and less rigorous than evaluate
, intended to be used to assess model functioning in order to improve it. The validate_model
provides an interactive interface for a single evaluator to sample n tags and corresponding texts and assess the accuracy of the tag to the text. The assign_validation_task
generates an external file(s) for n coders for redundancy of code assessments. This may be of use in Mechanical Turk type applications. The example below demonstrates validate_model
's print
/summary
and plot
outputs.
validated <- model %>% validate_model()
data(validated)
After validate_model
has been run the print
/summary
and plot
provides an accuracy of each tag and a confidence level (note that the confidence band is highly affected by the number of samples per tag).
validated
plot(validated)
These examples give guidance on how to use the tools in the termco package to build an expert rules, regular expression text classification model.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.