rbow allows you to analyze texts with a view towards relevant terms, their contexts and associations among each other. It can can run word frequency in context analyses on multiple texts and dictionaries.
rbow currently has implementations for three main types of analysis:
Deductive approach: allows you to analyze frequency of co-occurrence of two sets of terms (phenomena and descriptors). You may wish to use this feature in order to test hypotheses as to the manner in which phenomena are commonly described.
Inductive approach: allows you to analyze frequency of terms associated with a set of terms. You may wish to use this feature for exploratory analysis about how phenomena are commonly described
Term Frequency Inverse Document Frequency (TF-IDF) Analysis. You may wish to use this feature to identify texts that are most relevant to a chosen set of terms.
rbow further enables you to compute bootstrapped confidence intervals for the frequency measures derived from analysis 1, and plot your results. It also includes some utility functions for text cleaning and stemming.
You can install the development version of rbow from GitHub with:
devtools::install_github("till-tietz/rbow")
The following example, while not at all an interesting/principled analysis, will nonetheless hopefully give a decent overview of the types of analyses rbow supports and you may wish to perform.
library(rbow)
library(tidyverse)
#lets get some data to analyze. we'll be using the corpus of jane austen novels from
#janeaustenr (because why not)
#install.packages("janeaustenr")
library(janeaustenr)
books <- janeaustenr::austen_books()
#lets transform our data such that we have a data.frame of book names and each book's text
#as a single character string
books <- books%>%
dplyr::group_by(book)%>%
dplyr::mutate(text = paste(text, collapse = " "))%>%
dplyr::slice(.,1)%>%
dplyr::ungroup()%>%
dplyr::mutate_at(.,2, as.character())%>%
tibble::column_to_rownames(., var = "book")
#as rbow operates on lists we'll transform our data.frame into a list
books <- setNames(split(books, seq(nrow(books))), rownames(books))
#now we'll transform each book's text from a character string to a character vector
#(i.e we'll tokenize each book)
books <- lapply(books, function(x)strsplit(x[,1]," ")[[1]])
#you may wish to run rbow's utility functions for text cleaning and stemming at this point
#clean_text will turn your tokens to lower case, remove stopwords, symbols, numbers etc.
#stem_texts stems your tokens
books <- rbow::clean_text(texts = books, rm_stopwords = TRUE, stopwords_language = "en")
books <- rbow::stem_texts(texts = books, language = "english")
#now let's define some phenomena we want to analyze. we might wish to know whether
#men and women are described/regarded differently in jane austen's work.
#we'll define a list of terms capturing men and women
#you can supply your own regex and set own_regex to TRUE in the analysis functions
#or let rbow construct a default regex for you (* overrides the word end boundary)
phenomena <- list(female = c("mrs","ms","miss","she","her","lady"),
male = c("mr","sir","he","him","lord"))
phenomena <- rbow::stem_texts(texts = phenomena, language = "english")
#let's create a set of descriptor terms to deductively test some hypothesis about how descriptions of men and women differ in jane austen's work
descriptors <- list(positive = c("ador*","affection*","appreciat*","cheer*","content*","deligh*","ecsta*","enjoy*","fondness","glad","happy","hope","joy","love","loves","lovin"),
anxiety = c("araid","anxi*","apprehens*","doom","dread*","fear*","fright","nervous","panic","paranoi*","petrif*","phobi*","scare*","scary","terrifi*","terrify*"),
anger = c("aggravat*","anger","angr*","annoy*","appall*","contempt","despis*","frustrat","fury","furious","hate*","mad","resent"),
sadness = c("aline","anguish*","apath*","bitter","crushed","depress*","despair","disappoint*","grief","griev*","heartbreak*","helpless","hopeless","loss","sad","melanchol*","sorrow"))
You can now analyze how frequently positive, anxious, angry or sad descriptors occur within some window around male or female words i.e. whether male or female words are relatively more frequently associated with these four descriptors.
bow_analysis <- rbow::bow_analysis(corpus = books, phenomenon = phenomena, descriptors = descriptors,
window = 10, per_occurrence = TRUE, own_regex = FALSE)
You can create bootstrap confidence intervals for these estimates like this
future::plan("multisession")
cis <- rbow::bow_ci(bow_analysis_output = bow_analysis, bootstraps = 1000,
alpha = 0.95, window = 10, per_occurrence = TRUE,
bootstrap_terms = TRUE)
and create a simple ci plot
plot_data <- rbow::create_plot_data(bstrap_output = cis)
# we now have a data frame of ggplot ready results for each text
# to plot the estimates and cis for text_1 simply call
rbow::ci_plot(plot_data = plot_data[[1]])
# you can plot subsets of phenomena and descriptors like this
plot_data <- rbow::create_plot_data(bstrap_output = cis,
phenomena = c("female"),
descriptors = c("positive","anxiety"))
rbow::ci_plot(plot_data = plot_data[[1]])
You may also wish to explore how female and male words are commonly described in Jane Austen’s work inductively. dfm_analysis caputres the most frequently used terms within some window of your phenomena terms.
dfm <- rbow::dfm_analysis(corpus = books, phenomenon = phenomena, window = 10,
n_terms = 10, own_regex = FALSE)
head(dfm[[1]])
#> $female
#> Var1 Freq
#> 1 dashwood 333
#> 2 jen 320
#> 3 s 308
#> 4 elinor 192
#> 5 mrs 182
#> 6 middleton 168
#> 7 said 159
#> 8 mariann 153
#> 9 ferrar 123
#> 10 sister 111
#>
#> $male
#> Var1 Freq
#> 1 john 139
#> 2 s 116
#> 3 mrs 68
#> 4 palmer 63
#> 5 said 63
#> 6 ferrar 56
#> 7 dashwood 55
#> 8 elinor 53
#> 9 willoughbi 51
#> 10 know 47
You can extract words that are unique to the context of your phenomena terms by computing tf-idf instead of raw frequencies
dfm <- rbow::dfm_analysis(corpus = books, phenomenon = phenomena, window = 10,
n_terms = 10, tf_idf = TRUE ,own_regex = FALSE)
head(dfm[[1]])
#> $female
#> Var1 tf-idf
#> 1115 jen 0.0120285845
#> 1321 mrs 0.0068412574
#> 1291 miss 0.0039468793
#> 1145 ladi 0.0031950928
#> 1313 morton 0.0009021438
#> 645 elder 0.0004886612
#> 673 enforc 0.0002631253
#> 1102 introduct 0.0001879466
#> 1286 mirth 0.0001879466
#> 1872 spark 0.0001879466
#>
#> $male
#> Var1 tf-idf
#> 843 mr 0.0042394323
#> 1171 sir 0.0022257020
#> 983 pratt 0.0013778155
#> 766 lord 0.0007419007
#> 260 conjur 0.0003179574
#> 612 henri 0.0003179574
#> 47 allus 0.0002119716
#> 223 clerk 0.0002119716
#> 421 em 0.0002119716
#> 547 fuss 0.0002119716
If you additionally only want to consider certain types of words (i.e. adjectives or adverbs) in your frequency analysis you can do the following
#look at types of words to filter by
unique(tidytext::parts_of_speech[,"pos"])
#> # A tibble: 14 x 1
#> pos
#> <chr>
#> 1 Adjective
#> 2 Noun
#> 3 <NA>
#> 4 Plural
#> 5 Adverb
#> 6 Preposition
#> 7 Verb (transitive)
#> 8 Verb (usu participle)
#> 9 Verb (intransitive)
#> 10 Interjection
#> 11 Noun Phrase
#> 12 Conjunction
#> 13 Definite Article
#> 14 Pronoun
#display only adjectives and adverbs
dfm <- rbow::dfm_analysis(corpus = books, phenomenon = phenomena, window = 10, n_terms = 10,
tf_idf = TRUE, filter_ps = TRUE, ps = c("Adjective","Adverb"),
own_regex = FALSE)
head(dfm[[1]])
#> $female
#> Var1 tf-idf
#> 645 elder 4.886612e-04
#> 1590 putrid 1.127680e-04
#> 2216 westward 1.127680e-04
#> 1063 infect 7.517865e-05
#> 1270 merrier 7.517865e-05
#> 1537 prettiest 7.517865e-05
#> 2036 throughout 7.517865e-05
#> 2245 withdrawn 7.517865e-05
#> 97 amiss 3.758933e-05
#> 262 brave 3.758933e-05
#>
#> $male
#> Var1 tf-idf
#> 82 arch 0.0001059858
#> 145 bigger 0.0001059858
#> 149 black 0.0001059858
#> 163 brave 0.0001059858
#> 362 disinterested 0.0001059858
#> 540 friendliest 0.0001059858
#> 616 hinder 0.0001059858
#> 778 mad 0.0001059858
#> 815 mid 0.0001059858
#> 939 pearl 0.0001059858
You can finally subset the output of dfm_analysis by another dictionary/set of terms
dfm <- rbow::dfm_analysis(corpus = books, phenomenon = phenomena, window = 10, n_terms = 10,
tf_idf = TRUE, filter_ps = TRUE, ps = c("Adjective","Adverb"),
filter_dictionary = descriptors[[4]], own_regex = FALSE)
head(dfm[[4]])
#> $female
#> Var1 tf-idf
#> 246 bitter 0
#> 1015 grievous 0
#> 1913 sad 0
#> NA <NA> NA
#> NA.1 <NA> NA
#> NA.2 <NA> NA
#> NA.3 <NA> NA
#> NA.4 <NA> NA
#> NA.5 <NA> NA
#> NA.6 <NA> NA
#>
#> $male
#> Var1 tf-idf
#> 983 grievous 0.0001128904
#> 239 bitter 0.0000000000
#> 1040 helpless 0.0000000000
#> 1878 sad 0.0000000000
#> NA <NA> NA
#> NA.1 <NA> NA
#> NA.2 <NA> NA
#> NA.3 <NA> NA
#> NA.4 <NA> NA
#> NA.5 <NA> NA
If you simply wish to find out which text is most relevant to a certain dictionary you can use rbow’s implementation of tf-idf
pride_and_prejudice_names <- c("bingley","bennet","darcy")
tf_idf <- rbow::tf_idf(corpus = books, terms = pride_and_prejudice_names)
head(tf_idf)
#> doc tf.idf
#> 2 Pride & Prejudice 0.004277573
#> 1 Sense & Sensibility 0.000000000
#> 3 Mansfield Park 0.000000000
#> 4 Emma 0.000000000
#> 5 Northanger Abbey 0.000000000
#> 6 Persuasion 0.000000000
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.