README.md

redditr

Travis build
status Coverage
status

Overview

redditr is an R package intended to help obtain content from Reddit by interfacing with the Pushshift.io Reddit API.

The immediate scope of this package is to provide functionality for importing Reddit comment and post data into R. Some additional functionality is provided for handling time-based information. Aside from that, redditr will not try to do any general text or data cleaning. Other, far more established packages are better options for handling text data once it’s been imported into R.

Let’s see how this goes…

Installation

Install via devtools:

devtools::install_github("geoffwlamb/redditr")

Examples

The redditr package’s flagship function, get_reddit_content, takes Pushshift.io API Search Parameters as arguments and returns a data.frame with information related your query. Below are some ideas for how you can use this function.

Basic Usage

If you call get_reddit_content without specifying any of the parameters the Pushshift API is looking for, you’ll end up getting the 500 most recent comments available from the api. If you want information relating to posts instead, you can change the content_type argument to “submission”.

# load redditr
library(redditr)

# get 500 most recent reddit comments avilable from api
recent_comments <- get_reddit_content()

# get 500 most recent posts
recent_posts <- get_reddit_content(content_type = "submission")

Adjusting Limits to Returned Results and Query Time

The Pushshift API limits returns a maximum of 500 results in a single query. You can use get_reddit_content to automate the process of sending multiple queries and collecting them into a single data frame. Also, if you are having issues with queries taking too long to return information, you can adjust the amount of time the function will wait for data before giving up. The function defaults to 10 seconds.

# get more than 500 comments
many_recent_comments <- get_reddit_content(
  content_type = "comment", 
  result_limit = 1000
)

# wait 20 seconds per query
patient_query <- get_reddit_content(
  content_type = "comment",
  timeout = 20
)

Using Pushshift API Parameters

This section is a very basic introduction to including Pushshift API parameters in your function calls. To explore more complex querying with the Pushshift API, I highly recommend checking out the Pushshift Documentation. In theory, get_reddit_content should be able to support any of the parameters mentioned in the linked documentation. However, that claim has not been fully verified in practice and only a handful of unit tests have been written to determine how parameters affect query results. If you do find any discrepancies between examples in the linked Pushshift Documentation and function output, please feel free to open an issue.

Using the Search (q) Parameter

The q parameter lets you search for specific text within a comment or a submission. Here are some use cases for how you might use it with a multi-word phrase in different ways:

# get comments containing the string "data science"
# note the double quotes inside the single quotes
data_science_comments <- get_reddit_content(
  content_type = "comment",
  q = '"data science"'
)

# get comments containing the string "data" AND the (separate) string "science"
data_and_science_comments <- get_reddit_content(
  content_type = "comment",
  q = "data+science"
)

# get comments containing the string "data" OR the (separate) string "science"
data_or_science_comments <- get_reddit_content(
  content_type = "comment",
  q = "data|science"
)

# get comments containing the string "data" but NOT the string "science"
# based on some light testing, the parentheses are needed on the non-negated part
# "(data)-science" and "(data)-(science)" do the same thing
# "data-(science)" does NOT
data_not_science_comments <- get_reddit_content(
  content_type = "comment",
  q = "(data)-science"
)

# a more complex query: "data science" or "macine learning" without "statistics"
ds_or_ml_without_stats <- get_reddit_content(
  content_type = "comment",
  q = '("data science"|"machine learning")-statistics'
)

Specifying Dates in API Queries

There are a few parameters in the Pushshift API that can be used to filter results based on time. The most common ones are before and after. These parameters are expecting dates to be provided in a very particular format: Unix Time. A function for converting typical date formats to Unix time is available as part of redditr: date_to_api.

# get comments before a specific date
comments_before_christmas <- get_reddit_content(
  content_type = "comment",
  before = date_to_api("2018-12-25 00:00:00", tz = "EST")
)

# get comments after a specific date 
comments_after_christmas <- get_reddit_content(
  content_type = "comment",
  after = date_to_api("2018-12-25 23:59:59", tz = "EST")
)

Using Other Pushshift API Parameters

Similar to above, the general format for declaring Pushshift API parameters in get_reddit_content is param = “value”. Please refer to Pushshift documentation for the full list of known parameters. Here are a few more examples with some available parameters that may be of interest:

# get posts from a specific subreddit
rstats_posts <- get_reddit_content(
  content_type = "submission",
  subreddit = "rstats"
)

# get comments from a specific user
hadley_comments <- get_reddit_content(
  content_type = "comment",
  author = "hadley"
)

# get posts that have received a particular amount of karma
good_posts <- get_reddit_content(
  content_type = "submission",
  score = ">1000"
)

#  combine parameters from all of the sections
data_science_posts_on_rstats_before_christmas <- get_reddit_content(
  content_type = "submission",
  result_limit = 100,
  q = "data science",
  subreddit = "rstats",
  before = date_to_api("2018-12-25 00:00:00", tz = "EST")
)

Hopefully that captures the essence of what this package aims to accomplish. Please feel free to let me know if anything isn’t working well and thanks for checking out redditr!



geoffwlamb/redditr documentation built on May 15, 2019, 11:41 a.m.