knitr::opts_chunk$set( # code chunk options echo = TRUE , eval = TRUE , warning = FALSE , message = FALSE , cached = FALSE , exercise = TRUE , exercise.completion = TRUE # figs , fig.align = "center" , fig.height = 4 , fig.width = 5.5 )
library(learnr) library(learn2scrape)
In this tutorial, you will learn how to download data from the Twitter Streaming API using the rtweet
package.
We will use the following R packages:
# to access Twitter REST API library(rtweet) # data wrangling library(jsonlite) library(dplyr) library(tidyr) library(stringr)
Make sure that you have your Twitter API credentials ready:
# ToDo: specify path to your secrets JSON file fp <- file.path(...) credentials <- fromJSON(fp) token <- do.call(create_token, credentials)
credentials <- fromJSON(system.file("extdata", "tw_credentials.json", package = "learn2scrape")) token <- do.call(create_token, credentials)
Note: If you don't, first go through the steps described in tutorial "103-twitter-setup" in the learn2scrape package: learnr::run_tutorial("103-twitter-setup", package = "learn2scrape")
To collect tweets as they are sent out, we can use the stream_tweets()
function.
By default, stream_tweets()
downloads a random sample of all publicly available tweets.
It has the following parameters:
file_name
indicates the file (path) where the tweets will be downloaded to on your local system
timeout
is the number of seconds that the connection will remain open. If you set it to FALSE
, it will stream indefinitely until the rate limit is reached.parse
specifies whether the tweets should be parsed from JSON. By default, it is TRUE
. But if you try to collect more data, your script will run better if you disable this and set parse = FALSE
because we omit the JSON parsing step before writing to file_name
. Now, we can collect tweets for 5 seconds.
# collect for 5 seconds resp <- stream_tweets(file_name = "tweets.json", timeout = 5, parse = FALSE) # read from disk and parse tweets <- parse_stream("tweets.json") # inspect nrow(tweets) # number of downloaded tweets range(tweets$created_at) # time range of downloaded tweets
Note: We write the JSON to a file called "tweets.json" because in learnr
tutorials, each code chunk has its own temporary directory that is deleted (incl. its contents) after execution. Usually, you can pass any file path constructed with file.path()
, however.
There are multiple variants how we can use the streaming API:
Sampling a small random sample of all publicly available tweets --- that is what we did above!
Filtering via a search-like query (up to 400 keywords)
Tracking via vector of user ids (up to 5000 user IDs)
Location via geo coordinates
To filter by keyword, we have to specify our search term as query q
:
tweets <- stream_tweets(q = "news", timeout = 5) nrow(tweets) head(tweets$text)
We could also provide a list of users (user IDs or screen names). However, this makes much more sense when looking at timelines and searching for previous tweets. We will do this in the next exercise.
This second example shows how to collect tweets filtering by geo location instead. In other words, we can set a geographical box and collect only the tweets that are coming from that area. After that, we can again load the tweets from disk into R:
For example, imagine we want to collect tweets from the United States. The way to do it is to find two pairs of coordinates (longitude and latitude) that indicate the southwest corner AND the northeast corner. Note the reverse order: it's not (lat, long), but (long, lat)!
In the case of the US, it would be approx. (-125, 26) and (-65, 49).
How to find these coordinates?
We use: https://getlatlong.net/
(If you have a Google Maps API key, you can also use the lookup_coords()
function built into rtweet
.)
usa_tweets <- stream_tweets(q = c(-125, 26, -65, 49), timeout = 5) nrow(usa_tweets) head(usa_tweets$text)
Note that there are different types of geographic information on tweets, some of it comes from geo-located tweets and others from tweets with place information.
rtweet
has a function called lat_lng()
that uses whatever geographic information is available to construct latitude and longitude variables.
We will work with whatever is available.
usa_tweets <- stream_tweets(q = c(-125, 26, -65, 49), timeout = 5) usa_tweets <- lat_lng(usa_tweets) # plot lat and lng points onto state map maps::map("state", lwd = .25) with(usa_tweets, points(lng, lat, pch = 20, cex = .75, col = rgb(0, .3, .7, .75)))
We now do some basic text analysis. This is not the focus of this class and you might want to do this differently, depending on which package you usually work with.
For example, we can ask what are the most popular hashtags at the moment? We will use regular expressions to extract hashtags.
The function str_extract_all()
in the stringr
package extracts one or several matches from a character vector.
The expression "#\w+" is a regular expression.
Specifically, it matches a hashtag-symbol and then any number of uninterrupted alpha-numeric symbols - so numbers or (upper- or lowercase) latin letters, and underscore.
Since str_extract_all()
returns a list of character vectors (one list element per input character value), we have to unlist the return object.
Finally, to get at the $k$ most popular hashtags, we sort the resulting vector by decreasing frequency.
# collect for 5 seconds resp <- stream_tweets(file_name = "tweets.json", timeout = 5, parse = FALSE) # read from disk and parse tweets <- parse_stream("tweets.json") # extract hashtags ht <- str_extract_all(tweets$text, "#\\w+") ht <- unlist(ht) # tabulate 6 most frequent ones head(sort(table(ht), decreasing = TRUE))
Similar analyses could be implemented for the following questions:
The most frequently mentioned users?
We again use a regular expression and str_extract_all()
.
Our search string is similar but it starts with an @ - so we find mentions - and this time, we only include Latin characters, numbers, and underscores.
# collect for 5 seconds resp <- stream_tweets(file_name = "tweets.json", timeout = 5, parse = FALSE) # read from disk and parse tweets <- parse_stream("tweets.json") # extract mentions mentions <- str_extract_all(tweets$text, '@[0-9_A-Za-z]+') mentions <- unlist(mentions) # report 10 most frequently mentioned accounts head(sort(table(mentions), decreasing = TRUE), n = 10)
How many tweets mention Joe Biden?
We try to detect tweets that mention either 'Biden' or 'biden' using str_detect()
and sum them up.
# collect for 5 seconds resp <- stream_tweets(file_name = "tweets.json", timeout = 5, parse = FALSE) # read from disk and parse tweets <- parse_stream("tweets.json") # count number of times terms 'biden'/'Biden' occur sum(str_detect(tweets$text, "[Bb]iden"))
These are toy examples, but for large files with tweets in JSON format, there might be faster ways to parse the data.
For example, the jsonlite
package specializes on parsing json data.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.