redditr is an R package intended to help obtain content from Reddit by interfacing with the Pushshift.io Reddit API.
The immediate scope of this package is to provide functionality for importing Reddit comment and post data into R. Some additional functionality is provided for handling time-based information. Aside from that, redditr will not try to do any general text or data cleaning. Other, far more established packages are better options for handling text data once it’s been imported into R.
Let’s see how this goes…
Install via devtools:
devtools::install_github("geoffwlamb/redditr")
The redditr package’s flagship function,
get_reddit_content
, takes Pushshift.io API Search
Parameters
as arguments and returns a data.frame with information related your
query. Below are some ideas for how you can use this function.
If you call get_reddit_content
without specifying any of
the parameters the Pushshift API is looking for, you’ll end up getting
the 500 most recent comments available from the api. If you want
information relating to posts instead, you can change the
content_type
argument to “submission”.
# load redditr
library(redditr)
# get 500 most recent reddit comments avilable from api
recent_comments <- get_reddit_content()
# get 500 most recent posts
recent_posts <- get_reddit_content(content_type = "submission")
The Pushshift API limits returns a maximum of 500 results in a single
query. You can use get_reddit_content
to automate the
process of sending multiple queries and collecting them into a single
data frame. Also, if you are having issues with queries taking too long
to return information, you can adjust the amount of time the function
will wait for data before giving up. The function defaults to 10
seconds.
# get more than 500 comments
many_recent_comments <- get_reddit_content(
content_type = "comment",
result_limit = 1000
)
# wait 20 seconds per query
patient_query <- get_reddit_content(
content_type = "comment",
timeout = 20
)
This section is a very basic introduction to including Pushshift API
parameters in your function calls. To explore more complex querying with
the Pushshift API, I highly recommend checking out the Pushshift
Documentation.
In theory, get_reddit_content
should be able to support
any of the parameters mentioned in the linked documentation. However,
that claim has not been fully verified in practice and only a handful of
unit tests have been written to determine how parameters affect query
results. If you do find any discrepancies between examples in the linked
Pushshift Documentation and function output, please feel free to open an
issue.
The q
parameter lets you search for specific text within a
comment or a submission. Here are some use cases for how you might use
it with a multi-word phrase in different ways:
# get comments containing the string "data science"
# note the double quotes inside the single quotes
data_science_comments <- get_reddit_content(
content_type = "comment",
q = '"data science"'
)
# get comments containing the string "data" AND the (separate) string "science"
data_and_science_comments <- get_reddit_content(
content_type = "comment",
q = "data+science"
)
# get comments containing the string "data" OR the (separate) string "science"
data_or_science_comments <- get_reddit_content(
content_type = "comment",
q = "data|science"
)
# get comments containing the string "data" but NOT the string "science"
# based on some light testing, the parentheses are needed on the non-negated part
# "(data)-science" and "(data)-(science)" do the same thing
# "data-(science)" does NOT
data_not_science_comments <- get_reddit_content(
content_type = "comment",
q = "(data)-science"
)
# a more complex query: "data science" or "macine learning" without "statistics"
ds_or_ml_without_stats <- get_reddit_content(
content_type = "comment",
q = '("data science"|"machine learning")-statistics'
)
There are a few parameters in the Pushshift API that can be used to
filter results based on time. The most common ones are
before
and after
. These parameters are
expecting dates to be provided in a very particular format: Unix
Time. A function for
converting typical date formats to Unix time is available as part of
redditr: date_to_api
.
# get comments before a specific date
comments_before_christmas <- get_reddit_content(
content_type = "comment",
before = date_to_api("2018-12-25 00:00:00", tz = "EST")
)
# get comments after a specific date
comments_after_christmas <- get_reddit_content(
content_type = "comment",
after = date_to_api("2018-12-25 23:59:59", tz = "EST")
)
Similar to above, the general format for declaring Pushshift API
parameters in get_reddit_content
is param = “value”.
Please refer to Pushshift documentation for the full list of known
parameters. Here are a few more examples with some available parameters
that may be of interest:
# get posts from a specific subreddit
rstats_posts <- get_reddit_content(
content_type = "submission",
subreddit = "rstats"
)
# get comments from a specific user
hadley_comments <- get_reddit_content(
content_type = "comment",
author = "hadley"
)
# get posts that have received a particular amount of karma
good_posts <- get_reddit_content(
content_type = "submission",
score = ">1000"
)
# combine parameters from all of the sections
data_science_posts_on_rstats_before_christmas <- get_reddit_content(
content_type = "submission",
result_limit = 100,
q = "data science",
subreddit = "rstats",
before = date_to_api("2018-12-25 00:00:00", tz = "EST")
)
Hopefully that captures the essence of what this package aims to accomplish. Please feel free to let me know if anything isn’t working well and thanks for checking out redditr!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.