knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

rbow

R-CMD-check

rbow allows you to analyze texts with a view towards relevant terms, their contexts and associations among each other. It can can run word frequency in context analyses on multiple texts and dictionaries.

rbow currently has implementations for three main types of analysis:

  1. Deductive approach: allows you to analyze frequency of co-occurrence of two sets of terms (phenomena and descriptors). You may wish to use this feature in order to test hypotheses as to the manner in which phenomena are commonly described.

  2. Inductive approach: allows you to analyze frequency of terms associated with a set of terms. You may wish to use this feature for exploratory analysis about how phenomena are commonly described

  3. Term Frequency Inverse Document Frequency (TF-IDF) Analysis. You may wish to use this feature to identify texts that are most relevant to a chosen set of terms.

rbow further enables you to compute bootstrapped confidence intervals for the frequency measures derived from analysis 1, and plot your results. It also includes some utility functions for text cleaning and stemming.

Installation

You can install the development version of rbow from GitHub with:

devtools::install_github("till-tietz/rbow")

Usage

The following example, while not at all an interesting/principled analysis, will nonetheless hopefully give a decent overview of the types of analyses rbow supports and you may wish to perform.

library(rbow)
library(tidyverse)

#lets get some data to analyze. we'll be using the corpus of jane austen novels from 
#janeaustenr (because why not)

#install.packages("janeaustenr")
library(janeaustenr)

books <- janeaustenr::austen_books()

#lets transform our data such that we have a data.frame of book names and each book's text 
#as a single character string 
books <- books%>%
  dplyr::group_by(book)%>%
  dplyr::mutate(text = paste(text, collapse = " "))%>%
  dplyr::slice(.,1)%>%
  dplyr::ungroup()%>%
  dplyr::mutate_at(.,2, as.character())%>%
  tibble::column_to_rownames(., var = "book")

#as rbow operates on lists we'll transform our data.frame into a list 
books <- setNames(split(books, seq(nrow(books))), rownames(books))

#now we'll transform each book's text from a character string to a character vector
#(i.e we'll tokenize each book)
books <- lapply(books, function(x)strsplit(x[,1]," ")[[1]])

#you may wish to run rbow's utility functions for text cleaning and stemming at this point
#clean_text will turn your tokens to lower case, remove stopwords, symbols, numbers etc. 
#stem_texts stems your tokens
books <- rbow::clean_text(texts = books, rm_stopwords = TRUE, stopwords_language = "en")
books <- rbow::stem_texts(texts = books, language = "english")

#now let's define some phenomena we want to analyze. we might wish to know whether 
#men and women are described/regarded differently in jane austen's work.
#we'll define a list of terms capturing men and women
#you can supply your own regex and set own_regex to TRUE in the analysis functions
#or let rbow construct a default regex for you (* overrides the word end boundary)
phenomena <- list(female = c("mrs","ms","miss","she","her","lady"),
                  male = c("mr","sir","he","him","lord"))

phenomena <- rbow::stem_texts(texts = phenomena, language = "english")

#let's create a set of descriptor terms to deductively test some hypothesis about how descriptions of men and women differ in jane austen's work 
descriptors <- list(positive = c("ador*","affection*","appreciat*","cheer*","content*","deligh*","ecsta*","enjoy*","fondness","glad","happy","hope","joy","love","loves","lovin"),
                    anxiety = c("araid","anxi*","apprehens*","doom","dread*","fear*","fright","nervous","panic","paranoi*","petrif*","phobi*","scare*","scary","terrifi*","terrify*"),
                    anger = c("aggravat*","anger","angr*","annoy*","appall*","contempt","despis*","frustrat","fury","furious","hate*","mad","resent"),
                    sadness = c("aline","anguish*","apath*","bitter","crushed","depress*","despair","disappoint*","grief","griev*","heartbreak*","helpless","hopeless","loss","sad","melanchol*","sorrow"))

You can now analyze how frequently positive, anxious, angry or sad descriptors occur within some window around male or female words i.e. whether male or female words are relatively more frequently associated with these four descriptors.

bow_analysis <- rbow::bow_analysis(corpus = books, phenomenon = phenomena, descriptors = descriptors,
                                   window = 10, per_occurrence = TRUE, own_regex = FALSE)

You can create bootstrap confidence intervals for these estimates like this

future::plan("multisession")
cis <- rbow::bow_ci(bow_analysis_output = bow_analysis, bootstraps = 1000,
                    alpha = 0.95, window = 10, per_occurrence = TRUE, 
                    bootstrap_terms = TRUE)

and create a simple ci plot

plot_data <- rbow::create_plot_data(bstrap_output = cis)

# we now have a data frame of ggplot ready results for each text
# to plot the estimates and cis for text_1 simply call 

rbow::ci_plot(plot_data = plot_data[[1]])

# you can plot subsets of phenomena and descriptors like this 

plot_data <- rbow::create_plot_data(bstrap_output = cis,
                                    phenomena = c("female"),
                                    descriptors = c("positive","anxiety"))

rbow::ci_plot(plot_data = plot_data[[1]])

You may also wish to explore how female and male words are commonly described in Jane Austen's work inductively. dfm_analysis caputres the most frequently used terms within some window of your phenomena terms.

dfm <- rbow::dfm_analysis(corpus = books, phenomenon = phenomena, window = 10,
                          n_terms = 10, own_regex = FALSE)
head(dfm[[1]])

You can extract words that are unique to the context of your phenomena terms by computing tf-idf instead of raw frequencies

dfm <- rbow::dfm_analysis(corpus = books, phenomenon = phenomena, window = 10,
                          n_terms = 10, tf_idf = TRUE ,own_regex = FALSE)
head(dfm[[1]])
````

If you additionally only want to consider certain types of words (i.e. adjectives or adverbs) in your frequency analysis you can do the following

```r
#look at types of words to filter by 
unique(tidytext::parts_of_speech[,"pos"])


#display only adjectives and adverbs
dfm <- rbow::dfm_analysis(corpus = books, phenomenon = phenomena, window = 10, n_terms = 10,
                                  tf_idf = TRUE, filter_ps = TRUE, ps = c("Adjective","Adverb"),
                                  own_regex = FALSE)
head(dfm[[1]])

You can finally subset the output of dfm_analysis by another dictionary/set of terms

dfm <- rbow::dfm_analysis(corpus = books, phenomenon = phenomena, window = 10, n_terms = 10,
                                  tf_idf = TRUE, filter_ps = TRUE, ps = c("Adjective","Adverb"),
                                  filter_dictionary = descriptors[[4]], own_regex = FALSE)
head(dfm[[4]])

If you simply wish to find out which text is most relevant to a certain dictionary you can use rbow's implementation of tf-idf

pride_and_prejudice_names <- c("bingley","bennet","darcy")

tf_idf <- rbow::tf_idf(corpus = books, terms = pride_and_prejudice_names)

head(tf_idf)


till-tietz/rbow documentation built on Oct. 21, 2021, 9:16 p.m.