knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
rbow allows you to analyze texts with a view towards relevant terms, their contexts and associations among each other. It can can run word frequency in context analyses on multiple texts and dictionaries.
rbow currently has implementations for three main types of analysis:
Deductive approach: allows you to analyze frequency of co-occurrence of two sets of terms (phenomena and descriptors). You may wish to use this feature in order to test hypotheses as to the manner in which phenomena are commonly described.
Inductive approach: allows you to analyze frequency of terms associated with a set of terms. You may wish to use this feature for exploratory analysis about how phenomena are commonly described
Term Frequency Inverse Document Frequency (TF-IDF) Analysis. You may wish to use this feature to identify texts that are most relevant to a chosen set of terms.
rbow further enables you to compute bootstrapped confidence intervals for the frequency measures derived from analysis 1, and plot your results. It also includes some utility functions for text cleaning and stemming.
You can install the development version of rbow from GitHub with:
devtools::install_github("till-tietz/rbow")
The following example, while not at all an interesting/principled analysis, will nonetheless hopefully give a decent overview of the types of analyses rbow supports and you may wish to perform.
library(rbow) library(tidyverse) #lets get some data to analyze. we'll be using the corpus of jane austen novels from #janeaustenr (because why not) #install.packages("janeaustenr") library(janeaustenr) books <- janeaustenr::austen_books() #lets transform our data such that we have a data.frame of book names and each book's text #as a single character string books <- books%>% dplyr::group_by(book)%>% dplyr::mutate(text = paste(text, collapse = " "))%>% dplyr::slice(.,1)%>% dplyr::ungroup()%>% dplyr::mutate_at(.,2, as.character())%>% tibble::column_to_rownames(., var = "book") #as rbow operates on lists we'll transform our data.frame into a list books <- setNames(split(books, seq(nrow(books))), rownames(books)) #now we'll transform each book's text from a character string to a character vector #(i.e we'll tokenize each book) books <- lapply(books, function(x)strsplit(x[,1]," ")[[1]]) #you may wish to run rbow's utility functions for text cleaning and stemming at this point #clean_text will turn your tokens to lower case, remove stopwords, symbols, numbers etc. #stem_texts stems your tokens books <- rbow::clean_text(texts = books, rm_stopwords = TRUE, stopwords_language = "en") books <- rbow::stem_texts(texts = books, language = "english") #now let's define some phenomena we want to analyze. we might wish to know whether #men and women are described/regarded differently in jane austen's work. #we'll define a list of terms capturing men and women #you can supply your own regex and set own_regex to TRUE in the analysis functions #or let rbow construct a default regex for you (* overrides the word end boundary) phenomena <- list(female = c("mrs","ms","miss","she","her","lady"), male = c("mr","sir","he","him","lord")) phenomena <- rbow::stem_texts(texts = phenomena, language = "english") #let's create a set of descriptor terms to deductively test some hypothesis about how descriptions of men and women differ in jane austen's work descriptors <- list(positive = c("ador*","affection*","appreciat*","cheer*","content*","deligh*","ecsta*","enjoy*","fondness","glad","happy","hope","joy","love","loves","lovin"), anxiety = c("araid","anxi*","apprehens*","doom","dread*","fear*","fright","nervous","panic","paranoi*","petrif*","phobi*","scare*","scary","terrifi*","terrify*"), anger = c("aggravat*","anger","angr*","annoy*","appall*","contempt","despis*","frustrat","fury","furious","hate*","mad","resent"), sadness = c("aline","anguish*","apath*","bitter","crushed","depress*","despair","disappoint*","grief","griev*","heartbreak*","helpless","hopeless","loss","sad","melanchol*","sorrow"))
You can now analyze how frequently positive, anxious, angry or sad descriptors occur within some window around male or female words i.e. whether male or female words are relatively more frequently associated with these four descriptors.
bow_analysis <- rbow::bow_analysis(corpus = books, phenomenon = phenomena, descriptors = descriptors, window = 10, per_occurrence = TRUE, own_regex = FALSE)
You can create bootstrap confidence intervals for these estimates like this
future::plan("multisession") cis <- rbow::bow_ci(bow_analysis_output = bow_analysis, bootstraps = 1000, alpha = 0.95, window = 10, per_occurrence = TRUE, bootstrap_terms = TRUE)
and create a simple ci plot
plot_data <- rbow::create_plot_data(bstrap_output = cis) # we now have a data frame of ggplot ready results for each text # to plot the estimates and cis for text_1 simply call rbow::ci_plot(plot_data = plot_data[[1]]) # you can plot subsets of phenomena and descriptors like this plot_data <- rbow::create_plot_data(bstrap_output = cis, phenomena = c("female"), descriptors = c("positive","anxiety")) rbow::ci_plot(plot_data = plot_data[[1]])
You may also wish to explore how female and male words are commonly described in Jane Austen's work inductively. dfm_analysis caputres the most frequently used terms within some window of your phenomena terms.
dfm <- rbow::dfm_analysis(corpus = books, phenomenon = phenomena, window = 10, n_terms = 10, own_regex = FALSE) head(dfm[[1]])
You can extract words that are unique to the context of your phenomena terms by computing tf-idf instead of raw frequencies
dfm <- rbow::dfm_analysis(corpus = books, phenomenon = phenomena, window = 10, n_terms = 10, tf_idf = TRUE ,own_regex = FALSE) head(dfm[[1]]) ```` If you additionally only want to consider certain types of words (i.e. adjectives or adverbs) in your frequency analysis you can do the following ```r #look at types of words to filter by unique(tidytext::parts_of_speech[,"pos"]) #display only adjectives and adverbs dfm <- rbow::dfm_analysis(corpus = books, phenomenon = phenomena, window = 10, n_terms = 10, tf_idf = TRUE, filter_ps = TRUE, ps = c("Adjective","Adverb"), own_regex = FALSE) head(dfm[[1]])
You can finally subset the output of dfm_analysis by another dictionary/set of terms
dfm <- rbow::dfm_analysis(corpus = books, phenomenon = phenomena, window = 10, n_terms = 10, tf_idf = TRUE, filter_ps = TRUE, ps = c("Adjective","Adverb"), filter_dictionary = descriptors[[4]], own_regex = FALSE) head(dfm[[4]])
If you simply wish to find out which text is most relevant to a certain dictionary you can use rbow's implementation of tf-idf
pride_and_prejudice_names <- c("bingley","bennet","darcy") tf_idf <- rbow::tf_idf(corpus = books, terms = pride_and_prejudice_names) head(tf_idf)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.