Home

/

GitHub

/

In theresagessler/learn2scrape: Learn To scrape web data in R

knitr::opts_chunk$set(
  # code chunk options
  echo = TRUE
  , eval = TRUE
  , warning = FALSE
  , message = FALSE
  , cached = FALSE 
  , exercise = TRUE
  , exercise.completion = TRUE
  # figs
  , fig.align = "center"
  , fig.height = 4
  , fig.width = 5.5
)

library(learnr)
library(learn2scrape)

Introduction

In this tutorial, you will learn how to download data from the Twitter Streaming API using the rtweet package.

R setup

R packages

We will use the following R packages:

# to access Twitter REST API
library(rtweet)

# data wrangling
library(jsonlite)
library(dplyr)
library(tidyr)
library(stringr)

Twitter API access token

Make sure that you have your Twitter API credentials ready:

# ToDo: specify path to your secrets JSON file
fp <- file.path(...)
credentials <- fromJSON(fp)
token <- do.call(create_token, credentials)

credentials <- fromJSON(system.file("extdata", "tw_credentials.json", package = "learn2scrape"))
token <- do.call(create_token, credentials)

Note: If you don't, first go through the steps described in tutorial "103-twitter-setup" in the learn2scrape package: learnr::run_tutorial("103-twitter-setup", package = "learn2scrape")

Collecting tweets

To collect tweets as they are sent out, we can use the stream_tweets() function. By default, stream_tweets() downloads a random sample of all publicly available tweets.

Function parameters

It has the following parameters:

file_name indicates the file (path) where the tweets will be downloaded to on your local system
timeout is the number of seconds that the connection will remain open. If you set it to FALSE, it will stream indefinitely until the rate limit is reached.
parse specifies whether the tweets should be parsed from JSON. By default, it is TRUE. But if you try to collect more data, your script will run better if you disable this and set parse = FALSE because we omit the JSON parsing step before writing to file_name.

Collecting tweets from the stream

Now, we can collect tweets for 5 seconds.

# collect for 5 seconds
resp <- stream_tweets(file_name = "tweets.json", timeout = 5, parse = FALSE)

# read from disk and parse
tweets <- parse_stream("tweets.json")

# inspect
nrow(tweets) # number of downloaded tweets
range(tweets$created_at) # time range of downloaded tweets

Note: We write the JSON to a file called "tweets.json" because in learnr tutorials, each code chunk has its own temporary directory that is deleted (incl. its contents) after execution. Usually, you can pass any file path constructed with file.path(), however.

Filtering Tweets

There are multiple variants how we can use the streaming API:

Sampling a small random sample of all publicly available tweets --- that is what we did above!
Filtering via a search-like query (up to 400 keywords)
Tracking via vector of user ids (up to 5000 user IDs)
Location via geo coordinates

Filtering by keyword

To filter by keyword, we have to specify our search term as query q:

tweets <- stream_tweets(q = "news", timeout = 5)

nrow(tweets)
head(tweets$text)

Filtering by users

We could also provide a list of users (user IDs or screen names). However, this makes much more sense when looking at timelines and searching for previous tweets. We will do this in the next exercise.

Filtering by geo location

This second example shows how to collect tweets filtering by geo location instead. In other words, we can set a geographical box and collect only the tweets that are coming from that area. After that, we can again load the tweets from disk into R:

For example, imagine we want to collect tweets from the United States. The way to do it is to find two pairs of coordinates (longitude and latitude) that indicate the southwest corner AND the northeast corner. Note the reverse order: it's not (lat, long), but (long, lat)!

In the case of the US, it would be approx. (-125, 26) and (-65, 49). How to find these coordinates? We use: https://getlatlong.net/ (If you have a Google Maps API key, you can also use the lookup_coords() function built into rtweet.)

usa_tweets <- stream_tweets(q = c(-125, 26, -65, 49), timeout = 5)

nrow(usa_tweets)
head(usa_tweets$text)

Note that there are different types of geographic information on tweets, some of it comes from geo-located tweets and others from tweets with place information. rtweet has a function called lat_lng() that uses whatever geographic information is available to construct latitude and longitude variables. We will work with whatever is available.

usa_tweets <- stream_tweets(q = c(-125, 26, -65, 49), timeout = 5)
usa_tweets <- lat_lng(usa_tweets)

# plot lat and lng points onto state map
maps::map("state", lwd = .25)
with(usa_tweets, points(lng, lat, pch = 20, cex = .75, col = rgb(0, .3, .7, .75)))

Some analyses

We now do some basic text analysis. This is not the focus of this class and you might want to do this differently, depending on which package you usually work with.

For example, we can ask what are the most popular hashtags at the moment? We will use regular expressions to extract hashtags.

The function str_extract_all() in the stringr package extracts one or several matches from a character vector. The expression "#\w+" is a regular expression. Specifically, it matches a hashtag-symbol and then any number of uninterrupted alpha-numeric symbols - so numbers or (upper- or lowercase) latin letters, and underscore. Since str_extract_all() returns a list of character vectors (one list element per input character value), we have to unlist the return object. Finally, to get at the $k$ most popular hashtags, we sort the resulting vector by decreasing frequency.

# collect for 5 seconds
resp <- stream_tweets(file_name = "tweets.json", timeout = 5, parse = FALSE)

# read from disk and parse
tweets <- parse_stream("tweets.json")

# extract hashtags
ht <- str_extract_all(tweets$text, "#\\w+")
ht <- unlist(ht)

# tabulate 6 most frequent ones
head(sort(table(ht), decreasing = TRUE))

Similar analyses could be implemented for the following questions:

The most frequently mentioned users?

We again use a regular expression and str_extract_all(). Our search string is similar but it starts with an @ - so we find mentions - and this time, we only include Latin characters, numbers, and underscores.

# collect for 5 seconds
resp <- stream_tweets(file_name = "tweets.json", timeout = 5, parse = FALSE)

# read from disk and parse
tweets <- parse_stream("tweets.json")

# extract mentions
mentions <- str_extract_all(tweets$text, '@[0-9_A-Za-z]+')
mentions <- unlist(mentions)

# report 10 most frequently mentioned accounts
head(sort(table(mentions), decreasing = TRUE), n = 10)

How many tweets mention Joe Biden?

We try to detect tweets that mention either 'Biden' or 'biden' using str_detect() and sum them up.

# collect for 5 seconds
resp <- stream_tweets(file_name = "tweets.json", timeout = 5, parse = FALSE)

# read from disk and parse
tweets <- parse_stream("tweets.json")

# count number of times terms 'biden'/'Biden' occur
sum(str_detect(tweets$text, "[Bb]iden"))

These are toy examples, but for large files with tweets in JSON format, there might be faster ways to parse the data. For example, the jsonlite package specializes on parsing json data.

theresagessler/learn2scrape documentation built on Dec. 23, 2021, 9:55 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

theresagessler/learn2scrape
Learn To scrape web data in R

In theresagessler/learn2scrape: Learn To scrape web data in R

Introduction

R setup

R packages

Twitter API access token

Collecting tweets

Function parameters

Collecting tweets from the stream

Filtering Tweets

Filtering by keyword

Filtering by users

Filtering by geo location

Some analyses

R Package Documentation

Browse R Packages

We want your feedback!

theresagessler/learn2scrape Learn To scrape web data in R

In theresagessler/learn2scrape: Learn To scrape web data in R

Introduction

R setup

R packages

Twitter API access token

Collecting tweets

Function parameters

Collecting tweets from the stream

Filtering Tweets

Filtering by keyword

Filtering by users

Filtering by geo location

Some analyses

R Package Documentation

Browse R Packages

We want your feedback!

theresagessler/learn2scrape
Learn To scrape web data in R