In jasonmtroos/rook: Introduction to Data Visualization, Web Scraping, and Text Analysis in R

knitr::opts_chunk$set(
  tidy = TRUE,
  tidy.opts = list(blank = FALSE, width.cutoff = 70),
  cache = TRUE
)
# knitr::knit_hooks$set(
# source = function(x,options){
#   if( options$engine == 'R' ) {
#     # format R code
#     x = highr::hilight( x, format = 'html' )
#   } else if( options$engine == 'bash' ) {
#     # format bash code
#     x = paste0( '<span class="hl std">$</span> ', unlist( stringr::str_split( x, '\\n' ) ), '\n', collapse = '' )
#   }
#   # for (i in names(options)) {
#   #   message( paste( i, options[[i]] ) )
#   # }
#   x = paste(x, collapse = "\n")
#   sprintf("<div class=\"%s\"><pre class=\"%s %s\"><code class=\"%s %s\">%s</code></pre></div>\n", 'sourceCode', 'sourceCode', tolower(options$engine), 'sourceCode', tolower(options$engine), x)
# }
# )

This session

Basic text manipulation tools
- REGEX
- stringr
Dictionary-based approach to text analysis
- tm
- tidytext
Topic-based approach to text analysis
- lda

REGEX

Short for regular expression
A sequence of characters that define a search pattern
Used to locate matches inside of strings of text
Examples

|REGEX | Match(es) | Notes |-------|----------|-------- |z |spoon7 | No matches |. |s p o o n 7 | Each character matches |[sp] |s p oon7 | s and p are matched separately |sp |sp oon7 | sp is matched as a whole |sp|on|sp o on 7 | Means "or" |[^sp]|sp o o n 7 | ^ negates the set [sp] |[m-q]|s p o o n 7 | A range of characters

Repetitions

In addition to matching single characters and literal strings, can also match multiples of patterns

|REGEX| Definition| |-----|-----------| |a* |Zero or more a's| |a+ |One or more a's| |a? |Zero or one a's| |a{5}|Exactly 5 a's| |a{3,}|3 or more a's| |a{1,3}|Between 1 and 3 a's|

Matches are greedy: in 'aaaa', a{2,4} matches aaaa and not aaa or aa

|REGEX| Definition| |-----|-----------| |a+? |As few a's as possible| |a{2,} |At least 2 a's, but as few as possible|

These restrictions matter when you are using REGEX's to define how to substitute new text in for old text

Special characters

|Character |REGEX| Character | REGEX| |---------------|-----|---------------|------| |tab |\t | ^ | \^ | |new line |\n | $ | \$ | |. |\. | [ | \[ | |\ |\\ | ] | \] | |+ |\+ | ( | $ | |* |\* | ) | $ | |? |\? | / | \/ | |{ |\{ | } | \} |

In R, the backslash character \ must be written as \\ inside of strings

cat('\\n')

So to write c:\ as a REGEX inside an R string, we write c:\\\\
Four backslashes instead of 1? Really? Yes. Sorry.

stringr

The stringr package includes tools for detecting whether a string matches a REGEX pattern
- The real work is performed by the stringi package, which is more flexible, but harder to work with
stringr::str_detect(s, r) returns TRUE if s can be matched by r

library(stringr)
s <- 'This is the string to be matched'
r <- 'm'
str_detect(s, r)

library(stringr)
s <- 'This is the string to be matched'
r <- ' [^ ] '
str_detect(s, r)

What does this REGEX mean?

library(stringr)
s <- 'This is the string to be matched'
r <- ' [^ ]+ '
str_detect(s, r)

Why does this one match?

str_match(s, r)[[1]] will return the portion of s that was matched by r

s <- 'This is the string to be matched'
r <- ' [^ ]+ '
str_match(s, r)[[1]]
r <- '(is ){2}'
str_match(s, r)[[1]]
r <- '(is )+?'
str_match(s, r)[[1]]

Task

Visit www.regexr.com
Write a regular expression that matches all of the following words:
- Expression
- expression
- expressions
Your regular expression goes between / and /g
When you're done, write REGEX's to match the following
1. Text text
2. Save Favorites but not available
3. Welcome RegExr Media Temple Edit ... and all words that begin with a capital letter, but no spaces or punctuation
4. testing and test but no punctuation

Task

library(stringr)
s <- 'This is hi$ fist'

Write regular expressions in R using str_match_all(s, r) that produce the following matches

str_match_all(s, 'is') %>% unlist
str_match_all(s, 'i[s\\$]') %>% unlist
str_match_all(s, '[hf]ist?') %>% unlist

Text mining tools

REGEX --- for matching, extracting, and possibly transforming lines of text
- stringr and stringi --- operate over vectors of character strings
tm --- for applying the same extraction or transformation operations to large collections of text
- For example, a set of documents, each of which is represented as a large character vector
- Quickly generate and work with term-document matrices
- tidytext provides a somewhat more %>%-friendly approach and has a free book that goes with it: http://tidytextmining.com
lda --- topical analysis of an already-cleaned collection of documents (e.g., previously prepared via tm)

Text mining with `tm`

Collect -- Organize texts from many sources in various formats in a single repository
Clean -- Pre-process the texts to make them easier to work with
Transform -- Generate numerical quantities based on the underlying texts for use in subsequent analysis
Subsequent analysis steps...

The `tm` package

The Corpus is a collection of related documents
- Related documents are those that we want to analyze or transform or clean, all as a group
- tm works with a variety of repository and file types (aka Sources)
- When scraping data from the web, we typically just use a VectorSource (i.e., text are stored in character vectors)
The Document is the basic unit of analysis, and might be a few short sentences (e.g., tweets) or a much longer text (e.g., a manuscript)
- When specifying a VectorSource, each element of the character vector is considered a separate "document"
The TermDocumentMatrix and DocumentTermMatrix are sparse matrices whose cells contain the number of co-occurances of each word in each document in the corpus

wzxhzdk:8 wzxhzdk:9 wzxhzdk:10

For this example, we will focus on the full text report

ufo_text <- 
  reports %>%
  filter(!str_detect(report, '^MADAR')) %>%
  .$report

library(tm)

ufo_corp <- Corpus(VectorSource(ufo_text)) %>%  # create corpus
  tm_map(content_transformer(tolower)) %>%      # lower case
  tm_map(removePunctuation) %>%                 # remove punctuation
  tm_map(removeWords, stopwords('english')) %>% # remove common words
  tm_map(stemDocument) %>%                      # reduce to word stems
  tm_map(stripWhitespace)                       # remove extra space 
ufo_corp
ufo_text[1557] %>% str_wrap %>% cat

tdm <- ufo_corp %>% TermDocumentMatrix
tdm_long <- with(tdm, tibble(term = i, document = j, count = v)) 
tdm_long

inspect(tdm[1:10, 1:6])

inspect(tdm[535:545, 1:6])

Use dplyr to aggregate counts across documents

tdm_long <- tdm_long %>% mutate(term = factor(term, labels = Terms(tdm)))
tdm_long %>% group_by(term) %>% summarise(count = sum(count)) %>% 
  arrange(desc(count))

Alternative method to achieve same end from the tidytext package:

tdm %>% tidytext::tidy()

Find terms that appear 400 times or more:

tdm %>% findFreqTerms(400)

Find terms that correlate with some other term

tdm %>% findAssocs("shape", .25) # cor(x,y) >= .25
tdm %>% findAssocs("sky", .25)

More about `tidytext`

library(tidytext)
tidy_reports <- tibble(report = ufo_text) %>% 
  mutate(id = row_number()) %>%
  unnest_tokens(word, report)
tidy_reports

Removing stop words with `tidytext`

cleaned_reports <- 
  tidy_reports %>%
  anti_join(get_stopwords(), by = 'word')

cleaned_reports %>%
  count(word, sort = TRUE)

"Bag-of-words" approaches to text tagging

Main idea: If a word has a particular meaning and appears in a particular document, then the document conveys in some sense the meaning of that word
Examples:
- If many of the words in a sentence were (previously) measured as highly emotional (e.g., extremely and hate), we construe the sentence to be highly emotional
- If a document contains many such sentences, we assume the document is highly emotional
The order of words does not affect inferences (as if the words were dropped into a bag)
Term-document matrices are inputs to bag-of-words analysis

Scoring a term-document matrix

wzxhzdk:24 wzxhzdk:25 * Sums occurrances of terms within each document (very quickly) wzxhzdk:26

Sentiment

The bag-of-words approach can be used to measure the sentiment of documents in a corpus
The current state of the art goes beyond the bag-of-words approach, taking advantage of word order, proximity to negating words, etc.
But the underlying concept is the same --- some words are highly positive or negative, and if they occur in a sentence, then the sentence is probably positive or negative

Word lists

Typical lists include positive-negative, and VAD
Top 99 percent V/A/D words:

word_scores_url <- 'http://crr.ugent.be/papers/Ratings_Warriner_et_al.csv'
(words <- readr::read_csv(word_scores_url) %>% 
    select(Word, valence = V.Mean.Sum, arousal = A.Mean.Sum, dominance = D.Mean.Sum))

words %>% gather(variable, value, -Word) %>% 
  group_by(variable) %>% 
  filter(percent_rank(value) > .99) %>% 
  ungroup %>% 
  arrange(Word)

Arousal scores for word stems

Get the scores for words scoring highest on arousal
Then "stem" the words so they match with our UFO data

(arousal_stems <- words %>% select(Word, arousal) %>% 
   mutate(term = stemDocument(Word)))

Merge scores and TDM

Join scores with the term-document matrix (in long data frame form)

tdm_long %>% inner_join(arousal_stems)

UFO report with highest arousal

For each document, sum the arousal scores for any words in our list
Arrange by score then print out the highest scoring report

d <-
  tdm_long %>% 
  inner_join(arousal_stems) %>% 
  group_by(document) %>% 
  summarise(score = sum(arousal)) %>% 
  arrange(desc(score)) %>% 
  top_n(1) %>% 
  getElement('document')
ufo_text[d] %>% 
  str_wrap(width = 70) %>%
  cat()

UFO reports with "lowest" arousal

d <-
  tdm_long %>% 
  inner_join(arousal_stems) %>% 
  group_by(document) %>% 
  summarise(score = -sum(arousal)) %>% 
  arrange(desc(score)) %>% 
  top_n(5) %>% 
  getElement('document')
ufo_text[d] %>%
  str_wrap(width = 70, exdent = 2) %>%
  str_c('* ', ., collapse = '\n') %>%
  cat()

`tidytext` approach (without stemming)

cleaned_reports
scored_reports <- 
  words %>% select(word = Word, arousal) %>%
  inner_join(cleaned_reports) %>%
  group_by(id) %>%
  summarise(arousal = sum(arousal))

Bottom 5 based on sum of arousal score

scored_reports %>% top_n(5, -arousal) %>% getElement('id') %>% ufo_text[.] %>% .[1:5] %>% 
  str_trunc(width = 250) %>%
  str_wrap(width = 70, exdent = 2) %>% 
  str_c("* ", .) %>% 
  cat(sep = "\n")

Top 5 based on sum of arousal score

scored_reports %>% top_n(5, arousal) %>% getElement('id') %>% ufo_text[.] %>% .[1:5] %>% 
  str_trunc(width = 250) %>%
  str_wrap(width = 70, exdent = 2) %>% 
  str_c("* ", .) %>% 
  cat(sep = "\n")

Sentences instead of words

d0 <- tibble(report = ufo_text) %>% mutate(id = row_number())
d1 <- d0 %>%
  unnest_tokens(sentence, report, token = "sentences")
d1

d2 <- 
  d1 %>% group_by(id) %>% mutate(sentence_id = row_number()) %>%
  unnest_tokens(word, sentence, token = "words") %>% ungroup()
d2

NRC sentiments

http://sentiment.nrc.ca/lexicons-for-research/

surprise <- 
  readr::read_tsv('~/Dropbox_RSM/Datasets/NRC-Sentiment-Emotion-Lexicons/NRC-Sentiment-Emotion-Lexicons/NRC-Emotion-Lexicon-v0.92/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt', col_names = c('word', 'affect', 'association')) %>%
  filter(affect %in% c('surprise') & association == 1) %>%
  transmute(word, surprise = 1L)

head(surprise)
top_5_surprise <- 
  d2 %>%
  left_join(surprise, by = 'word') %>%
  mutate(surprise = coalesce(surprise, 0L)) %>%
  group_by(id, sentence_id) %>%
  summarise(score = mean(surprise)) %>% ungroup() %>%
  group_by(id) %>%
  summarise(score = median(score)) %>%
  inner_join(d0, by = 'id') %>%
  top_n(5, score) %>%
  arrange(desc(score))

Top 5 based on surprise

top_5_surprise %>%
  getElement('report') %>%
  # str_trunc(width = 250) %>%
  str_wrap(width = 70, exdent = 2) %>% 
  str_c("* ", .) %>% 
  cat(sep = "\n")

Latent Dirichlet allocation

What are the topics of these sentences?

txt <- c('Abe is funny',
         'Abe is fat',
         'Bob is fat',
         'Cam is fat',
         'Cam is funny',
         'Cam is short',
         'Dee is funny')

Which of the following seems like the right set of main topics?

library(lda)
lc <- 
  txt %>% 
  removeWords(stopwords()) %>% 
  stripWhitespace %>% 
  lexicalize  # create a special structure used by the lda package
K <- 4  # number of topics to detect
set.seed(1234)
f1 <- lda.collapsed.gibbs.sampler(lc$documents, K, lc$vocab, 500, .5, .1)
(top.words <- top.topic.words(f1$topics, 1, by.score = TRUE))
K <- 2  # number of topics to detect
set.seed(1234)
f1 <- lda.collapsed.gibbs.sampler(lc$documents, K, lc$vocab, 500, .5, .1)
(top.words <- top.topic.words(f1$topics, 1, by.score = TRUE))

{width=100%} \

`lda`

library(lda)

library(lda)
lda_corp <-
  ufo_corp %>% 
  plyr::laply(as.character) %>% 
  str_replace('(^| )[0-9]( |$)', ' [number] ') %>% 
  str_replace('^ ', '') %>% 
  lexicalize
K <- 14  # number of topics to detect
set.seed(1234)
fit <-
  lda.collapsed.gibbs.sampler(lda_corp$documents, K, lda_corp$vocab, 500, .5, .1)
(top.words <- top.topic.words(fit$topics, 4, by.score = TRUE))

get_topic_names <- function(fit) {
  # code copied from lda::top.topic.words:
  normalized.topics <- fit$topics / (rowSums(fit$topics) + 1e-05)
  scores <- apply(normalized.topics, 2, function(x) x * (log(x + 1e-05) - sum(log(x + 1e-05)) / length(x)))
  # end
  cutoff <- min(apply(scores, 1, function(x) max(x)))
  scores %>% 
    as.data.frame %>% 
    mutate(f = 1:nrow(.)) %>% 
    gather(key, value, -f) %>% 
    filter(value >= cutoff) %>% 
    group_by(f) %>% 
    arrange(f, desc(value)) %>% 
    summarise(name = paste(key, collapse = ' ')) %>% 
    getElement('name')
}
get_topic_names(fit)

wzxhzdk:46

Assuming 5 topics rather than 14

K <- 5
set.seed(1234)
fit <- lda.collapsed.gibbs.sampler(lda_corp$documents, K, lda_corp$vocab, 500, .5, .1)
exemp_tab(fit)

No word stemming or stopword removal, and assuming 10 topics

ufo_corp_with_stops_no_stemming <- Corpus(VectorSource(ufo_text)) %>%  # create corpus
  tm_map(content_transformer(tolower)) %>%      # lower case
  tm_map(removePunctuation) %>%                 # remove punctuation
  tm_map(stripWhitespace)                       # remove extra space 
lda_corp_with_stops_no_stemming <- 
  ufo_corp_with_stops_no_stemming %>% 
  plyr::laply( as.character ) %>% 
  str_replace('(^| )[0-9]( |$)', ' [number] ') %>% 
  str_replace('^ ','') %>% 
  lexicalize

K <- 10
set.seed(1234)
fit <- lda.collapsed.gibbs.sampler(lda_corp_with_stops_no_stemming$documents,
    K, lda_corp_with_stops_no_stemming$vocab, 500, .5, .1)
exemp_tab(fit)

No stemming, but stopword removal

ufo_corp <- Corpus(VectorSource(ufo_text)) %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removePunctuation) %>% 
  tm_map(removeWords, stopwords('english')) %>%
  # tm_map(stemDocument) %>%
  tm_map(stripWhitespace) 

lda_corp <- ufo_corp %>% plyr::laply( as.character ) %>% 
  str_replace('(^| )[0-9]( |$)', ' [number] ') %>% str_replace('^ ','') %>% lexicalize

K <- 10 
set.seed(1234)
fit <- lda.collapsed.gibbs.sampler(lda_corp$documents, K, lda_corp$vocab, 500, .5, .1)
exemp_tab(fit)

The `topicmodels` package

Similar functionality, just a different implementation
Takes a DocumentTermMatrix object as input

library(topicmodels)
control <- list(seed = 1, iter = 1500, burnin = 500)
dtm <- DocumentTermMatrix(ufo_corp)
set.seed(1234)
fit2 <- LDA(dtm, k = 10, method = 'Gibbs', control = control)

Output from `topicmodels`

post <- posterior(fit2)
output <- 
  post$terms %>% 
  as_tibble %>%
  mutate(topic = row_number()) %>%
  gather(term, value, -topic) %>%
  group_by(topic) %>%
  top_n(5, value) %>%
  ungroup %>%
  arrange(topic, desc(value), term) %>%
  group_by(topic) %>%
  summarise(terms = paste(term, collapse = ', '))

output

Task

Visit https://www.dropbox.com/s/h1nvnt56hblxpgc/lovecraft_handout.html?dl=1 ...or...
http://bit.ly/30y9gQs

jasonmtroos/rook documentation built on May 24, 2020, 3:16 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jasonmtroos/rook
Introduction to Data Visualization, Web Scraping, and Text Analysis in R

In jasonmtroos/rook: Introduction to Data Visualization, Web Scraping, and Text Analysis in R

This session

REGEX

Repetitions

Special characters

stringr

Task

Task

Text mining tools

Text mining with `tm`

The `tm` package

More about `tidytext`

Removing stop words with `tidytext`

"Bag-of-words" approaches to text tagging

Scoring a term-document matrix

Sentiment

Word lists

Arousal scores for word stems

Merge scores and TDM

UFO report with highest arousal

UFO reports with "lowest" arousal

`tidytext` approach (without stemming)

Bottom 5 based on sum of arousal score

Top 5 based on sum of arousal score

Sentences instead of words

NRC sentiments

Top 5 based on surprise

Latent Dirichlet allocation

`lda`

The `topicmodels` package

Output from `topicmodels`

Task

R Package Documentation

Browse R Packages

We want your feedback!

jasonmtroos/rook Introduction to Data Visualization, Web Scraping, and Text Analysis in R

In jasonmtroos/rook: Introduction to Data Visualization, Web Scraping, and Text Analysis in R

This session

REGEX

Repetitions

Special characters

stringr

Task

Task

Text mining tools

Text mining with tm

The tm package

More about tidytext

Removing stop words with tidytext

"Bag-of-words" approaches to text tagging

Scoring a term-document matrix

Sentiment

Word lists

Arousal scores for word stems

Merge scores and TDM

UFO report with highest arousal

UFO reports with "lowest" arousal

tidytext approach (without stemming)

Bottom 5 based on sum of arousal score

Top 5 based on sum of arousal score

Sentences instead of words

NRC sentiments

Top 5 based on surprise

Latent Dirichlet allocation

lda

The topicmodels package

Output from topicmodels

Task

R Package Documentation

Browse R Packages

We want your feedback!

jasonmtroos/rook
Introduction to Data Visualization, Web Scraping, and Text Analysis in R

Text mining with `tm`

The `tm` package

More about `tidytext`

Removing stop words with `tidytext`

`tidytext` approach (without stemming)

`lda`

The `topicmodels` package

Output from `topicmodels`