get_reddit_content: Scrape Content from Multiple URLs and Combine into a Single...

Description Usage Arguments Value Examples

Description

This is the flagship function of redditr. It is designed to handle the process of constructing a query, generating URLs that point to content, and importing that content into R. If lower-level control over that workflow is needed, please see construct_pushshift_url and import_reddit_content_from_url.

Usage

1
2
get_reddit_content(content_type = "comment", result_limit = 500,
  timeout = 10, ...)

Arguments

content_type

A string containing the type of content you want to query. The pushshift api supports the following options: "comment" and "submission". This function defaults to "comment" and gets passed to construct_pushshift_url.

result_limit

An integer representing the maximum number of results to return. Defaults to 500, which is the maximum number of results that can be returned in a single pushshift api query URL. Please keep in mind your available system resources and any potential burden on other servers when determining the number of rows you need.

timeout

An integer representing the maximum amount of time to allow for retrieving content from a single URL. Defaults to 10 seconds. When result_limit is over 500, the timeout resets for every 500 results that have been returned successfully.

...

Additional arguments to pass to construct_pushshift_url that are used to build the api query.

Value

A data.frame with content imported from your query

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
# basic examples ----

# get 500 most recent reddit comments avilable from api
recent_comments <- get_reddit_content()

# get 500 most recent posts
recent_posts <- get_reddit_content(content_type = "submission")

# get more than 500 comments
many_recent_comments <- get_reddit_content(
  content_type = "comment",
  result_limit = 1000
)

# wait longer than default 10 seconds per query
patient_query <- get_reddit_content(
  content_type = "comment",
  timeout = 20
)


# search term examples ----

# get comments containing the string "data science"
# note the double quotes inside the single quotes
data_science_comments <- get_reddit_content(
  content_type = "comment",
  q = '"data science"'
)

# get comments containing the string "data" AND the (separate) string "science"
data_and_science_comments <- get_reddit_content(
  content_type = "comment",
  q = "data+science"
)

# get comments containing the string "data" OR the (separate) string "science"
data_or_science_comments <- get_reddit_content(
  content_type = "comment",
  q = "data|science"
)

# get comments containing the string "data" but NOT the string "science"
# based on some light testing, the parentheses are needed on the non-negated part
# "(data)-science" and "(data)-(science)" do the same thing
# "data-(science)" does NOT
data_not_science_comments <- get_reddit_content(
  content_type = "comment",
  q = "(data)-science"
)

# a more complex query: "data science" or "macine learning" without "statistics"
ds_or_ml_without_stats <- get_reddit_content(
  content_type = "comment",
  q = '("data science"|"machine learning")-statistics'
)


# time-based examples ----

# get comments before a specific date
comments_before_christmas <- get_reddit_content(
  content_type = "comment",
  before = date_to_api("2018-12-25 00:00:00", tz = "EST")
)

# get comments after a specific date
comments_after_christmas <- get_reddit_content(
  content_type = "comment",
  after = date_to_api("2018-12-25 23:59:59", tz = "EST")
)


# other pushift api parameter examples ----

# get posts from a specific subreddit
rstats_posts <- get_reddit_content(
  content_type = "submission",
  subreddit = "rstats"
)

# get comments from a specific user
hadley_comments <- get_reddit_content(
  content_type = "comment",
  author = "hadley"
)

# get posts that have received a particular amount of karma
good_posts <- get_reddit_content(
  content_type = "submission",
  score = ">1000"
)

#  combine parameters
data_science_posts_on_rstats_before_christmas <- get_reddit_content(
  content_type = "submission",
  result_limit = 100,
  q = "data science",
  subreddit = "rstats",
  before = date_to_api("2018-12-25 00:00:00", tz = "EST")
)

geoffwlamb/redditr documentation built on May 15, 2019, 11:41 a.m.