knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This vignette outlines the intended use of the textclass package in its current form. The package is able to:

Provided Data

This package includes AFICAdata dataset. It is a dataframe of all Information Technology contracts for installation support from Air Force Installation Contracting Agency.

term_frequency

The term_frequency function allows the user to plot term or n-gram frequency depending on how many consecutive terms (n) for which they wish to evaluate term frequency.

library(textclass)
term_frequency(AFICAdata, n=3)

tidyDTM

The tidyDTM function allows the user to construct a dtm for a given sparsity.

#tidyDTM(data, sparsity)
dtm <- tidyDTM(AFICAdata, 0.98)
dtm

plot_topics

The plot_topics function allows the user to plot the topic models from a Latent Dirichlet Allocation (LDA) model. This plot shows the most associated (beta) terms with each of the topics. The LDA function is an external function for which the user can create the LDA model and define the number of topics requested.

lda <- LDA(dtm, k = 4)
plot_topics(lda)

optimal_topics, normalize_metrics, and plot_optimal_topics

The functions optimal_topics, normalize_metrics, and plot_optimal_topicsallow the use to evaluate the structure of topics for every number of topics k from 2 to 30. The metrics used are those proposed by Cao and Deveaud. Cao is a metric that needs to be minimized for optimality, while Deveaud is intended to be maximized. This takes very long without a quad-core computer. Fortunately, included with the package is pre-calculated optimal topic data, topic_data.

#returns raw values for optimal topic analysis
#values <- optimal_topics(dtm)
values <- topic_data

#normalizes the values
norm_values <- normalize_metrics(values)
#plots the optimal topic analysis metrics
plot_optimal_topics(norm_values)

As the Cao metric needs to be minimized to find optimality and the Deveaud metric maximized, this package provides a table using rank_topics that will rank order the topics by the best balance between the two metrics for the analyst to assess all possible points of optimality.

#creates a table of rank ordered topic numbers
rank_topics(norm_values)


williamcsevier/textclass documentation built on May 26, 2019, 5:36 a.m.