knitr::opts_chunk$set(
  # collapse = TRUE,
  fig.align = "center",
  comment = "#>",
  fig.path = "man/figures/",
  message = FALSE,
  warning = FALSE
)

options(width = 150)

Gitter Lifecycle AppVeyor build status Travis-CI Build Status Codecov test coverage GitHub last commit License: GPL v3 Depends CRAN status GitHub code size in bytes

Introduction

{tweetio}'s goal is to enable safe, efficient I/O and transformation of Twitter data. Whether the data came from the Twitter API, a database dump, or some other source, {tweetio}'s job is to get them into R and ready for analysis.

{tweetio} is not a competitor to {rtweet}: it is not interested in collecting Twitter data. That said, it definitely attempts to compliment it by emulating its data frame schema because...

  1. It's incredibly easy to use.
  2. It's more efficient to analyze than a key-value format following the raw data.
  3. It'd be a waste not to maximize compatibility with tools built specifically around {rtweet}'s data frames.

Installation

You'll need a C++ compiler. If you're using Windows, that means Rtools.

if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")

remotes::install_github("knapply/tweetio")

Usage

library(tweetio)

{tweetio} uses {data.table} internally for performance and stability reasons, but if you're a {tidyverse} fan who's accustomed to dealing with tibbles, you can set an option so that tibbles are always returned.

Because tibbles have an incredibly informative and user-friendly print() method, we'll set the option for examples. Note that if the {tibble} package is not installed, this option is ignored.

options(tweetio.as_tibble = TRUE)

You can check on all available {tweetio} options using tweetio_options().

tweetio_options()

Simple Example

First, we'll save a stream of tweets using rtweet::stream_tweets().

temp_file <- tempfile(fileext = ".json")
rtweet::stream_tweets(timeout = 15, parse = FALSE,
                      file_name = temp_file)

We can then pass the file path to tweetio::read_tweets() to efficiently parse the data into an {rtweet}-style data frame.

tiny_rtweet_stream <- read_tweets(temp_file)
tiny_rtweet_stream

Performance

rtweet::parse_stream() is totally sufficient for smaller files (as long as the returned data are valid JSON), but tweetio::read_tweets() is much faster.

small_rtweet_stream <- "inst/example-data/api-stream-small.json.gz"

res <- bench::mark(
  rtweet = rtweet::parse_stream(small_rtweet_stream),
  tweetio = tweetio::read_tweets(small_rtweet_stream)
  ,
  check = FALSE,
  filter_gc = FALSE
)

res[, 1:9]

With bigger files, using rtweet::parse_stream() is no longer realistic, especially if the JSON are invalid.

# big_tweet_stream_path <- "~/ufc-tweet-stream.json.gz"
# 
# raw_lines <- readLines(big_tweet_stream_path)
# valid_lines <- purrr::map_lgl(raw_lines,
#                               ~ jsonify::validate_json(.x) && jsonlite::validate(.x))
# writeLines(raw_lines[valid_lines][1:100000], "inst/example-data/ufc-tweet-stream.json")
# R.utils::gzip("inst/example-data/ufc-tweet-stream.json",
#               "inst/example-data/ufc-tweet-stream.json.gz",
#               remove = FALSE, overwrite = TRUE)
big_tweet_stream_path <- "inst/example-data/ufc-tweet-stream.json.gz"

temp_file <- tempfile(fileext = ".json")
R.utils::gunzip(big_tweet_stream_path, destname = temp_file, remove = FALSE)

c(`compressed MB` = file.size(big_tweet_stream_path) / 1e6,
  `decompressed MB` = file.size(temp_file) / 1e6)
res <- bench::mark(
  rtweet = rtweet_df <- rtweet::parse_stream(big_tweet_stream_path),
  tweetio = tweetio_df <- tweetio::read_tweets(big_tweet_stream_path)
  ,
  filter_gc = FALSE,
  check = FALSE,
  iterations = 1
)

res[, 1:9]

Not only is tweetio::read_tweets() more efficient in time and memory usage, it's able to successfully parse much more of the data.

`rownames<-`(
  vapply(list(tweetio_df = tweetio_df, rtweet_df = rtweet_df), dim, integer(2L)),
  c("nrow", "ncol")
)

Data Dumps

A common practice for handling social media data at scale is to store them in search engine databases like Elasticsearch, but it's (unfortunately) possible that you'll need to work with data dumps.

I've encountered two flavors of these schema (that may be in .gzip files or ZIP archives):

  1. .jsonl: newline-delimited JSON
  2. .json: the complete contents of a database dump packed in a JSON array

This has three unfortunate consequences:

  1. Packages that were purpose-built to work directly with {rtweet}'s data frames can't play along with your data.
  2. You're going to waste most of your time (and memory) getting data into R that you're not going to use.
  3. The data are very tedious to restructure in R (lists of lists of lists of lists of lists...).

{tweetio} solves this by parsing everything and building the data frames at the C++ level, including handling GZIP files and ZIP archives for you.

Spatial Tweets

If you have {sf} installed, you can use as_tweet_sf() to only keep those tweets that contain valid bounding box polygons or points.

tweet_sf <- as_tweet_sf(tweetio_df)
tweet_sf[, "geometry"]

There are currently four columns that can potentially hold spatial geometries:

  1. "bbox_coords"
  2. "quoted_bbox_coords"
  3. "retweet_bbox_coords"
  4. "geo_coords"

You can select which one to use to build your sf object by modifying the geom_col= parameter (default: "bbox_coords")

as_tweet_sf(tweetio_df,
            geom_col = "quoted_bbox_coords")[, "geometry"]

You can also build all the supported bounding boxes by setting geom_col= to "all".

all_bboxes <- as_tweet_sf(tweetio_df, geom_col = "all")
all_bboxes[, c("which_geom", "geometry")]

From there, you can easily use the data like any other {sf} object.

library(ggplot2)

world <- rnaturalearth::ne_countries(returnclass = "sf")
world <- world[world$continent != "Antarctica", ]

ggplot(all_bboxes) +
  geom_sf(fill = "white", color = "lightgray", data = world) +
  geom_sf(aes(fill = which_geom, color = which_geom), 
          alpha = 0.15, size = 1, show.legend = TRUE) +
  coord_sf(crs = 3857) +
  scale_fill_viridis_d() +
  scale_color_viridis_d() +
  theme(legend.title = element_blank(), legend.position = "top",
        panel.background = element_rect(fill = "#daf3ff"))

Tweet Networks

If you want to analyze tweet networks and have {igraph} or {network} installed, you can get started immediately using tweetio::as_tweet_igraph() or tweetio::as_tweet_network().

tweet_df <- tweetio_df[1:1e4, ]

as_tweet_igraph(tweet_df)
as_tweet_network(tweet_df)

If you want to take advantage of all the metadata available, you can set all_status_data and/or all_user_data to TRUE

as_tweet_igraph(tweet_df,
                all_user_data = TRUE, all_status_data = TRUE)
as_tweet_network(tweet_df,
                 all_user_data = TRUE, all_status_data = TRUE)

Two-Mode Networks

You can also build two-mode networks by specifying the target_class as "hashtag"s, "url"s, or "media".

If bipartite, the returned objects are always set as undirected.

Users to Hashtags

as_tweet_igraph(tweet_df, target_class = "hashtag")
as_tweet_network(tweet_df, target_class = "hashtag")

Users to URLs

as_tweet_igraph(tweet_df, target_class = "url")
as_tweet_network(tweet_df, target_class = "url")

Users to Media

as_tweet_igraph(tweet_df, target_class = "media")
as_tweet_network(tweet_df, target_class = "media")

<proto_net>

You're not stuck with going directly to <igraph>s or <network>s though. Underneath the hood, as_tweet_igraph() and as_tweet_network() use as_proto_net() to build a <proto_net>, a list of edge and node data frames.

as_proto_net(tweetio_df,
             all_status_data = TRUE, all_user_data = TRUE)

Progress

Supported Data Inputs

Supported Data Outputs

Structures

Shout Outs

The {rtweet} package spoils R users rotten, in the best possible way. The underlying data carpentry is so seamless that the user doesn't need to know anything about the horrors of Twitter data, which is pretty amazing. If you use {rtweet}, you probably owe Michael Kearney some citations.

{tweetio} uses a combination of C++ via {Rcpp}, the rapidjson C++ library (made available by {rapidjsonr}), {jsonify}) for an R-level interface to rapidjson, {RcppProgress}), and R's not-so-secret super weapon: {data.table}.

Major inspiration from {ndjson} was taken, particularly its use of Gzstream.

Environment

sessionInfo()


knapply/tweetio documentation built on Dec. 22, 2020, 7:15 p.m.