knitr::opts_chunk$set( # collapse = TRUE, fig.align = "center", comment = "#>", fig.path = "man/figures/", message = FALSE, warning = FALSE ) options(width = 150)
{tweetio}
's goal is to enable safe, efficient I/O and transformation of Twitter data. Whether the data came from the Twitter API, a database dump, or some other source, {tweetio}
's job is to get them into R and ready for analysis.
{tweetio}
is not a competitor to {rtweet}
: it is not interested in collecting Twitter data. That said, it definitely attempts to compliment it by emulating its data frame schema because...
{rtweet}
's data frames.You'll need a C++ compiler. If you're using Windows, that means Rtools.
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes") remotes::install_github("knapply/tweetio")
library(tweetio)
{tweetio}
uses {data.table}
internally for performance and stability reasons, but if you're a {tidyverse}
fan who's accustomed to dealing with tibble
s, you can set an option so that tibble
s are always returned.
Because tibble
s have an incredibly informative and user-friendly print()
method, we'll set the option for examples. Note that if the {tibble}
package is not installed, this option is ignored.
options(tweetio.as_tibble = TRUE)
You can check on all available {tweetio}
options using tweetio_options()
.
tweetio_options()
First, we'll save a stream of tweets using rtweet::stream_tweets()
.
temp_file <- tempfile(fileext = ".json") rtweet::stream_tweets(timeout = 15, parse = FALSE, file_name = temp_file)
We can then pass the file path to tweetio::read_tweets()
to efficiently parse the data into an {rtweet}
-style data frame.
tiny_rtweet_stream <- read_tweets(temp_file) tiny_rtweet_stream
rtweet::parse_stream()
is totally sufficient for smaller files (as long as the returned data are valid JSON), but tweetio::read_tweets()
is much faster.
small_rtweet_stream <- "inst/example-data/api-stream-small.json.gz" res <- bench::mark( rtweet = rtweet::parse_stream(small_rtweet_stream), tweetio = tweetio::read_tweets(small_rtweet_stream) , check = FALSE, filter_gc = FALSE ) res[, 1:9]
With bigger files, using rtweet::parse_stream()
is no longer realistic, especially if the JSON are invalid.
# big_tweet_stream_path <- "~/ufc-tweet-stream.json.gz" # # raw_lines <- readLines(big_tweet_stream_path) # valid_lines <- purrr::map_lgl(raw_lines, # ~ jsonify::validate_json(.x) && jsonlite::validate(.x)) # writeLines(raw_lines[valid_lines][1:100000], "inst/example-data/ufc-tweet-stream.json") # R.utils::gzip("inst/example-data/ufc-tweet-stream.json", # "inst/example-data/ufc-tweet-stream.json.gz", # remove = FALSE, overwrite = TRUE)
big_tweet_stream_path <- "inst/example-data/ufc-tweet-stream.json.gz" temp_file <- tempfile(fileext = ".json") R.utils::gunzip(big_tweet_stream_path, destname = temp_file, remove = FALSE) c(`compressed MB` = file.size(big_tweet_stream_path) / 1e6, `decompressed MB` = file.size(temp_file) / 1e6)
res <- bench::mark( rtweet = rtweet_df <- rtweet::parse_stream(big_tweet_stream_path), tweetio = tweetio_df <- tweetio::read_tweets(big_tweet_stream_path) , filter_gc = FALSE, check = FALSE, iterations = 1 ) res[, 1:9]
Not only is tweetio::read_tweets()
more efficient in time and memory usage, it's able to successfully parse much more of the data.
`rownames<-`( vapply(list(tweetio_df = tweetio_df, rtweet_df = rtweet_df), dim, integer(2L)), c("nrow", "ncol") )
A common practice for handling social media data at scale is to store them in search engine databases like Elasticsearch, but it's (unfortunately) possible that you'll need to work with data dumps.
I've encountered two flavors of these schema (that may be in .gzip files or ZIP archives):
This has three unfortunate consequences:
{rtweet}
's data frames can't play along with your data.{tweetio}
solves this by parsing everything and building the data frames at the C++ level, including handling GZIP files and ZIP archives for you.
If you have {sf}
installed, you can use as_tweet_sf()
to only keep those tweets that contain valid bounding box polygons or points.
tweet_sf <- as_tweet_sf(tweetio_df) tweet_sf[, "geometry"]
There are currently four columns that can potentially hold spatial geometries:
"bbox_coords"
"quoted_bbox_coords"
"retweet_bbox_coords"
"geo_coords"
You can select which one to use to build your sf
object by modifying the geom_col=
parameter (default: "bbox_coords"
)
as_tweet_sf(tweetio_df, geom_col = "quoted_bbox_coords")[, "geometry"]
You can also build all the supported bounding boxes by setting geom_col=
to "all"
.
all_bboxes <- as_tweet_sf(tweetio_df, geom_col = "all") all_bboxes[, c("which_geom", "geometry")]
From there, you can easily use the data like any other {sf}
object.
library(ggplot2) world <- rnaturalearth::ne_countries(returnclass = "sf") world <- world[world$continent != "Antarctica", ] ggplot(all_bboxes) + geom_sf(fill = "white", color = "lightgray", data = world) + geom_sf(aes(fill = which_geom, color = which_geom), alpha = 0.15, size = 1, show.legend = TRUE) + coord_sf(crs = 3857) + scale_fill_viridis_d() + scale_color_viridis_d() + theme(legend.title = element_blank(), legend.position = "top", panel.background = element_rect(fill = "#daf3ff"))
If you want to analyze tweet networks and have {igraph}
or {network}
installed, you can get started immediately using tweetio::as_tweet_igraph()
or tweetio::as_tweet_network()
.
tweet_df <- tweetio_df[1:1e4, ] as_tweet_igraph(tweet_df) as_tweet_network(tweet_df)
If you want to take advantage of all the metadata available, you can set all_status_data
and/or all_user_data
to TRUE
as_tweet_igraph(tweet_df, all_user_data = TRUE, all_status_data = TRUE) as_tweet_network(tweet_df, all_user_data = TRUE, all_status_data = TRUE)
You can also build two-mode networks by specifying the target_class
as "hashtag"
s, "url"
s, or "media"
.
<igraph>
s will be set as bipartite following {igraph}
's convention of a logical
vertex attribute specifying each partition. Accounts are always TRUE
.<network>
s will be set as bipartite following {network}
's convention of ordering the "actors" first, and setting the network-level attribute of "bipartite" as the number of "actors". Accounts are always the "actors".If bipartite, the returned objects are always set as undirected.
as_tweet_igraph(tweet_df, target_class = "hashtag") as_tweet_network(tweet_df, target_class = "hashtag")
as_tweet_igraph(tweet_df, target_class = "url") as_tweet_network(tweet_df, target_class = "url")
as_tweet_igraph(tweet_df, target_class = "media") as_tweet_network(tweet_df, target_class = "media")
<proto_net>
You're not stuck with going directly to <igraph>
s or <network>
s though. Underneath the hood, as_tweet_igraph()
and as_tweet_network()
use as_proto_net()
to build a <proto_net>
, a list of edge and node data frames.
as_proto_net(tweetio_df, all_status_data = TRUE, all_user_data = TRUE)
{rtweet}
-style data frames{sf}
{igraph}
{network}
The {rtweet}
package spoils R users rotten, in the best possible way. The underlying data carpentry is so seamless that the user doesn't need to know anything about the horrors of Twitter data, which is pretty amazing. If you use {rtweet}
, you probably owe Michael Kearney some citations.
{tweetio}
uses a combination of C++ via {Rcpp}
, the rapidjson
C++ library (made available by {rapidjsonr}
), {jsonify}
) for an R-level interface to rapidjson
, {RcppProgress}
), and R's not-so-secret super weapon: {data.table}
.
Major inspiration from {ndjson}
was taken, particularly its use of Gzstream
.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.