get_reddit_content: Scrape Content from Multiple URLs and Combine into a Single...
In geoffwlamb/redditr: Reddit Content Scraper

Description Usage Arguments Value Examples

This is the flagship function of redditr. It is designed to handle the process of constructing a query, generating URLs that point to content, and importing that content into R. If lower-level control over that workflow is needed, please see construct_pushshift_url and import_reddit_content_from_url.

1 2	get_reddit_content(content_type = "comment", result_limit = 500, timeout = 10, ...)

`content_type`	A string containing the type of content you want to query. The pushshift api supports the following options: "comment" and "submission". This function defaults to "comment" and gets passed to construct_pushshift_url.
`result_limit`	An integer representing the maximum number of results to return. Defaults to 500, which is the maximum number of results that can be returned in a single pushshift api query URL. Please keep in mind your available system resources and any potential burden on other servers when determining the number of rows you need.
`timeout`	An integer representing the maximum amount of time to allow for retrieving content from a single URL. Defaults to 10 seconds. When `result_limit` is over 500, the timeout resets for every 500 results that have been returned successfully.
`...`	Additional arguments to pass to construct_pushshift_url that are used to build the api query.

A data.frame with content imported from your query

# basic examples ----

# get 500 most recent reddit comments avilable from api
recent_comments <- get_reddit_content()

# get 500 most recent posts
recent_posts <- get_reddit_content(content_type = "submission")

# get more than 500 comments
many_recent_comments <- get_reddit_content(
  content_type = "comment",
  result_limit = 1000
)

# wait longer than default 10 seconds per query
patient_query <- get_reddit_content(
  content_type = "comment",
  timeout = 20
)


# search term examples ----

# get comments containing the string "data science"
# note the double quotes inside the single quotes
data_science_comments <- get_reddit_content(
  content_type = "comment",
  q = '"data science"'
)

# get comments containing the string "data" AND the (separate) string "science"
data_and_science_comments <- get_reddit_content(
  content_type = "comment",
  q = "data+science"
)

# get comments containing the string "data" OR the (separate) string "science"
data_or_science_comments <- get_reddit_content(
  content_type = "comment",
  q = "data|science"
)

# get comments containing the string "data" but NOT the string "science"
# based on some light testing, the parentheses are needed on the non-negated part
# "(data)-science" and "(data)-(science)" do the same thing
# "data-(science)" does NOT
data_not_science_comments <- get_reddit_content(
  content_type = "comment",
  q = "(data)-science"
)

# a more complex query: "data science" or "macine learning" without "statistics"
ds_or_ml_without_stats <- get_reddit_content(
  content_type = "comment",
  q = '("data science"|"machine learning")-statistics'
)


# time-based examples ----

# get comments before a specific date
comments_before_christmas <- get_reddit_content(
  content_type = "comment",
  before = date_to_api("2018-12-25 00:00:00", tz = "EST")
)

# get comments after a specific date
comments_after_christmas <- get_reddit_content(
  content_type = "comment",
  after = date_to_api("2018-12-25 23:59:59", tz = "EST")
)


# other pushift api parameter examples ----

# get posts from a specific subreddit
rstats_posts <- get_reddit_content(
  content_type = "submission",
  subreddit = "rstats"
)

# get comments from a specific user
hadley_comments <- get_reddit_content(
  content_type = "comment",
  author = "hadley"
)

# get posts that have received a particular amount of karma
good_posts <- get_reddit_content(
  content_type = "submission",
  score = ">1000"
)

#  combine parameters
data_science_posts_on_rstats_before_christmas <- get_reddit_content(
  content_type = "submission",
  result_limit = 100,
  q = "data science",
  subreddit = "rstats",
  before = date_to_api("2018-12-25 00:00:00", tz = "EST")
)