README.md

textclass

Charter Sevier

Build Status

This package provides a tool to simply automate text analytics in interpretable graphical output.

How to load:

devtools::install_github("williamcsevier/textclass", dependencies = TRUE, build_vignettes = TRUE)

library(textclass)

Accompanying this package is a Shiny App for AFICA Information Technology contract text analysis:

https://charterapps.shinyapps.io/afica_topic_modeling_and_text_analysis/

vignette("textclass") will provide in detail information on how the package is intended to be used.

Description

The package will be able to accept dataframes or tibbles containing text data and perform various text analysis procedures, with options to return the raw data for further manipulation, or ggplot graphics of the text analysis methods. This allows the user to easily analyze their text data without having to do any of the required preprocessing. In addition, even the most novice of data scientists can portray their text analysis results cleanly, with limited knowledge of R.

Specifically, this package will give the user options to output most frequent terms, n-gram analysis, and topic modeling analysis. Topic modeling involved allowing the user to determine how many topics with which to model their topics, and output the most associated terms with each topics, as well word-word correlation, and document-topic association.

A shiny app was developed for AFICA analysts to accomplish exploratory text analysis to accompany my thesis research.

End-user

The typical end user for whom this analytic is being developed would be anyone who wishes to easily present their text data in an interpretable fashion, with limited required understanding of how text analytics operates. However, this tool could be equally as useful to a seasoned data scientist, who wishes to quickly accomplish exploratory analysis of text data, while also extracting useful raw data for follow-on text analysis outside of the scope of this tool.

Prerequisite Knowledge

The end-user will have to know how import a .csv file into R as a dataframe and how to install and load an R package through CRAN and github. The package will output default products (term frequency, n-gram analysis) without user specified parameters, but a user with more knowledge of text analysis could choose specific parameters for their output and increase the number of generated analysis products from their text data.

Term Dictionary

Term Frequency: The frequency of terms throughout the corpus (collection of words)

Term Frequency – Inverse Document Frequency (tf-idf): The frequency of terms in the a document relative to how many documents the word is present.

Latent Dirichlet Allocation (LDA): Algorithm for topic modeling. It is the method for estimating bother the mixture of topics in a document, and the mixture of words in a topic simultaneously. The following variables are generated by the LDA model:

    Beta: per-topic-per-word probability

    Gamma: per-document-per-topic proability

Existing R-Packages required

This analytic utilizes the tidytext package for tidy-form representation of LDA model outputs.

How to access package?

The end-user will access textclass through github using devtools. williamcsevier/textclass

There are no security concerns

The package will use ggplot2 package for graphical output.

Description Priority Status Value Inputs Outputs Application Achievable? Current Version? Term Frequency 1 in-work Provides Most/Least Frequent Terms Text dataframe, most frequent (T or F), and top number of terms portrayed Ggplot bar plot of terms and their frequencies Analysis of most frequent terms in corpus Yes Yes n-gram Analysis 2 in-work Provides most least frequent n-grams Number of n-grams to be portrayed, and,the value of n (number of word pairs) Ggplot bar plot of n-grams and their frequencies Analysis of frequencies in which words are used in conjunction. Yes Yes Topic Modeling 3 in-work Provides modeled topics based on LDA model for user-specified number of topics. The number of topics the user wishes to,model. As well as how many of the most associated words to portray Facet ggplot of the most associated,words with each of the topics. Facilitate interpretation of the modeled topics based on their most associated words. Yes Yes

williamcsevier/textclass documentation built on May 26, 2019, 5:36 a.m.