tidytext
package to analyze social media dataclass: center, inverse background-image: url("img/dsieur_pkgs.png") background-size: contain
"the process of automatically classifying and extracting meaningful information from unstructured text. It involves detecting and interpreting trends and patterns to obtain relevant insights from data" *
.footnote[ *Someone on Google]
.pull-left[
]
.pull-right[
]
dataedu
packageraw_tweets <- dataedu::tt_tweets
glimpse(raw_tweets)
rtweet
library(rtweet) #limited to the last 7 days on the free tier tt <- search_tweets("#TidyTuesday", n = 10000, include_rts = FALSE)
Learn more about the rtweet package!
.pull-left[
Right now, our data is 4,418 x 90! ] .pull-right[
status_id
, text
library(tidyverse) tweets <- raw_tweets %>% # filter for English tweets filter(lang == "en") %>% select(status_id, text) %>% # Convert the ID field to the character mutate(status_id = as.character(status_id))
A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix.
To do this we use the tidytext::unnest_tokens()
function.
library(tidytext) tokens <- tweets %>% unnest_tokens(output = word, input = text)
dim(tokens) #> [1] 131233 2
.pull-left[
]
.pull-right[ There are a lot of unnecessary words included, like "a" or "the". These are called stop words.
Thankfully, we can filter those words by using the stop_words
dataset included in the tidytext
package.
data(stop_words) tokens <- tokens %>% # note the that the varname "word" was # chosen intentionally so that joining # would be simpler anti_join(stop_words, by = "word")
This part is fairly simple. We can use methods from dplyr
we already know!
.pull-left[
tokens %>% count(word, sort = TRUE)
Conclusion: “tidytuesday” represents only 6% of all words ] .pull-right[
tokens %>% count(word, sort = TRUE) %>% # n as a percent of total words mutate(percent = n / sum(n) * 100)
]
When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust
status_id
s that contain positive words and that contain the word dataviz
status_id
for only those with data_viz
mentioned status_id
data_viz
the vocabulary of a person, language, or branch of knowledge
AFIN
from Finn Årup Nielsenbing
from Bing Liu and collaboratorsnrc
from Saif Mohammad and Peter TurneyNote: All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment
library(textdata) # chosing to use the nrc lexicon get_sentiments("nrc")
.pull-left[
nrc_pos <- get_sentiments("nrc") %>% filter(sentiment == "positive") # Match to tokens pos_tokens_count <- tokens %>% inner_join(nrc_pos, by = "word") %>% # Total appearance of positive words count(word, sort = TRUE)
] .pull-right[
library(dataedu) pos_tokens_count %>% # only words that appear 75 times or more filter(n >= 75) %>% ggplot(., aes(x = reorder(word, -n), y = n)) + geom_bar(stat = "identity", fill = dataedu_colors("darkblue")) + labs( title = "Count of Words Associated with Positivity", subtitle = "Tweets with the hashtag #tidytuesday", caption = "Data: Twitter and NRC", x = "", y = "Count" ) + theme_dataedu()
Dataviz
Words (Step 5)dataviz
also had at least one positive word?Step 5: Create a vector of unique status_id
that contain positive words and that contain the word dataviz
dv_tokens <- tokens %>% filter(word == "dataviz") %>% # there are several duplicate status ids distinct(status_id)
pos_tokens <- tokens %>% filter(word %in% nrc_pos$word) %>% # there are several duplicate status ids distinct(status_id)
dataviz
.pull-left[
dataviz
and see if that id also contained a positive worddv_pos <- tweets %>% # Only tweets that have the dataviz status_id filter(status_id %in% dv_tokens$status_id) %>% # Is the status_id from our vector of positive word? mutate(positive = if_else(status_id %in% pos_tokens$status_id, 1, 0))
] .pull-right[
dv_pos %>% count(positive) %>% mutate(perc = n / sum(n))
.pull-left[ Sometimes it's helpful to get pick a random selection of tweets to see how people are talking about #TidyTuesday.
To do this, we will filter our tweet data on just positive words, not just those that have dataviz
in them.
pos_tweets <- tweets %>% # filter only on tweets with a positive word mutate(positive = if_else(status_id %in% pos_tokens$status_id, 1, 0)) %>% filter(positive == 1)
To get a random sample, we will use the set.seed()
and dplyr::sample_n()
functions.
Wait...set.seed()
??
+ set the starting point used R's random number generator
+ ensures reproducible results
]
.pull-right[
set.seed(123) pos_tweets %>% sample_n(., size = 10)
.pull-left[ + Text Mining with R
+ rtweet Documentation
]
.pull-right[
+ Twitter Developer Portal
]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.