scrape_tweet_ids: Scrape Tweets IDs from screen

Description Usage Arguments Details Value Scroll sleep WARNING

View source: R/screen_scrape_tweets.R

Description

Given a twitter account screen name or ID, and start and end dates, function screen-scrapes IDs of historical tweets in time range and returns them in a data frame. Optionally, the scraped IDs can additionally be written to disk (if write.out = TRUE).

Usage

1
2
3
4
5
6
scrape_tweet_ids(tw.account, remdr, since.date, until.date,
  date.interval = "month", max.tweets.pi = 10000, write.out = TRUE,
  write.out.path,
  write.out.name = sprintf("tw_user_%s_tweet_ids_%s.json", tw.account,
  paste0(since.date, "_to_", until.date)), sleep = 0.5,
  .scroll.sleep = 0.75, verbose = TRUE)

Arguments

tw.account

a scalar character vector, specifying a Twitter screen name or account ID

remdr

an active RSelenium remoteDriver object (check remdr$getStatus() to see if the driver is running.)

since.date

create date of oldest tweets to get Only accepts dates in format '%Y-%m-%d' (Year-month-day: 'YYYY-mm-dd')

until.date

create date of most recent (youngest) tweets to get Only accepts dates in format '%Y-%m-%d' (Year-month-day: 'YYYY-mm-dd')

date.interval

date interval passed to 'by' argument of seq.Date. Defaults to 'month'.

max.tweets.pi

maximum nuber of tweets per intevall to load. Defaults to 10'000. (See Dtails section)

write.out

logical. write out tweet IDs as JSON to disk? If TRUE (the default), JSON file will be written to path write.out.path and named write.out.name. If FALSE, write.out.path and write.out.name will be ignored.

write.out.path

Write out path (directory where to write scraped IDs file) Will be ignored if write.out = FALSE

write.out.name

JSON file name. Defaults to 'tw_user_<tw.account>_tweet_ids_<since.date>_to_<until.date>.json' Will be ignored if write.out = FALSE

sleep

Seconds to pause between date ranges when iterating over date intervals defined by since.date, until.date and date.interval. Defaults to .5 seconds

.scroll.sleep

Seconds to pause between scrolls when scrolling for more tweets. Defautls to .75 seconds. (See section 'scroll sleep' for details.)

verbose

logical. Print out status messages?

Details

Note that the maximum number of tweets loaded per date interval (max.tweets.pi) needs to be adapted to the date interval. Per scroll, 20 new tweets are loaded. By default, there comes a pause of .75 seconds between scrolls. This means that at maximum, waiting for 10'000 tweets to load takes ((10000/20) * .75)/60 = 6.25 minutes.

Value

A tibble data frame. The data frame is empty if an error occurs or no tweet IDs were scraped in the given time range. Otherwise it has columns 'account' (<chr>), 'since' (<date>), 'until' (<date>) and 'tweet_id' (<chr>), and one row is one tweet.

Scroll sleep

Argument .scroll.sleep determines how much the Twitter timeline has to fully load. WARNING: Setting low values (<.75 seconds) endangers not getting all tweet IDs, as the scraping process can be aborted prematurely due to too little scroll sleep. The default setting of .75 seconds is a minumum with fast internet connection.

WARNING


haukelicht/twscrape documentation built on Jan. 29, 2020, 3:23 p.m.