bind_tweets: Bind information stored as JSON files

View source: R/bind_tweets.R

bind_tweetsR Documentation

Bind information stored as JSON files

Description

This function binds information stored as JSON files. The experimental function convert_json converts individual JSON files into either "raw" or "tidy" format.

Usage

bind_tweets(
  data_path,
  user = FALSE,
  verbose = TRUE,
  output_format = NA,
  vars = c("text", "user", "tweet_metrics", "user_metrics", "hashtags", "ext_urls",
    "mentions", "annotations", "context_annotations"),
  quoted_variables = FALSE
)

convert_json(
  data_file,
  output_format = "tidy",
  vars = c("text", "user", "tweet_metrics", "user_metrics", "hashtags", "ext_urls",
    "mentions", "annotations", "context_annotations"),
  quoted_variables = F
)

Arguments

data_path

string, file path to directory of stored tweets data saved as data_id.json and users_id.json

user

If FALSE, this function binds JSON files into a data frame containing tweets; data frame containing user information otherwise. Ignore if output_format is not NA

verbose

If FALSE, messages are suppressed

output_format

[Experimental] string, if it is not NA, this function return an unprocessed data.frame containing either tweets or user information. Currently, this function supports the following format(s)

  • "raw"List of data frames; Note: not all data frames are in Boyce-Codd 3rd Normal Form

  • "tidy"Tidy format; all essential columns are available

  • "tidy2"Tidy format; additional variables (see vars) are available. Untruncates retweet text and adds indicators for retweets, quotes and replies. Automatically drops duplicated tweets. Handling of quoted tweets can be specified (see quoted_variables)

vars

[Experimental] vector of strings, determining the variables provided by the tidy2 format. Can be any (or all) of the following:

  • "text"Text of the tweet, including language classification, indicator of sensitive content and (if applicable) sourcetweet text

  • "user"Information on the user in addition to their ID

  • "tweet_metrics"Tweet metrics, specifically the like, retweet and quote counts

  • "user_metrics"User metrics, specifically their tweet, list, follower and following counts

  • "hashtags"Hashtags contained in the tweet. Untrunctated for retweets

  • "ext_urls"Shortened and expanded URLs contained in the tweet, excluding those internal to Twitter (e.g. retweet URLs). Includes additional data provided by Twitter, such as the unwound URL, their title and description (if available). Untrunctated for retweets

  • "mentions"Mentioned usernames and their IDs, excluding retweeted users. Untrunctated for retweets. Note that quoted users are only mentioned here if explicitly named in the tweet text. This was usually the case with older versions of Twitter, but is no longer the standard behaviour. Extracting mentions allows the usernames of the RT authors (rather than only their ID) to be preserved

  • "annotations"Annotations provided by Twitter, including their probability and type. Basically Named Entities. See https://developer.twitter.com/en/docs/twitter-api/annotations/overview for details

  • "context_annotations"Context annotations provided by Twitter, including additional data on their domains. See https://developer.twitter.com/en/docs/twitter-api/annotations/overview for details

quoted_variables

[Experimental] Should additional vars be returned for the quoted tweet? Defaults to FALSE. TRUE returns additional "_quoted" var-columns containing the vars (mentions, hashtags, etc.) of the quoted tweet in addition to the actual tweet's data

data_file

string, a single file path to a JSON file; or a vector of file paths to JSON files of stored tweets data saved as data_id.json

Details

By default, bind_tweets binds into a data frame containing tweets (from data_id.json files).

If users is TRUE, it binds into a data frame containing user information (from users_id.json).

For the "tidy" and "tidy2" format, parallel processing with furrr is supported. In order to enable parallel processing, workers need to be set manually through future::plan(). See examples

Note that output of the tidy2 vars returns results of the Twitter API, rather than from tweet text. Therefore, certain variables, especially context annotations and quoted_variables, may not be present in older data.

Value

a data.frame containing either tweets or user information

Examples

## Not run: 
# bind json files in the directory "data" into a data frame containing tweets
bind_tweets(data_path = "data/")

# bind json files in the directory "data" into a data frame containing user information
bind_tweets(data_path = "data/", user = TRUE)

# bind json files in the directory "data" into a "tidy" data frame / tibble
bind_tweets(data_path = "data/", user = TRUE, output_format = "tidy")

# bind json files in the directory "data" into a "tidy2" data frame / tibble, get hashtags and
# URLs for both original and quoted tweets
bind_tweets(data_path = "data/", user = TRUE, output_format = "tidy2", 
            vars = c("hashtags", "ext_urls"),
            quoted_variables = T)
            
# bind json files in the directory "data" into a "tidy2" data frame / tibble with parallel computing
## set up a multisession
future::plan("multisession")
## run the function - note that no additional arguments are required
bind_tweets(data_path = "data/", user = TRUE, output_format = "tidy2")
## Shut down parallel workers
future::plan("sequential")            

## End(Not run)

cjbarrie/academictwitteR documentation built on Dec. 20, 2024, 6:03 p.m.