knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  cache = TRUE,
  cache.path = "README-cache/"
)

stackbigquery

stackbigquery is a package wrapping the Stack Overflow database on Google BigQuery.

This is a minimal example of using dbcooper to create a database package:

Installation

You can install the development version of stackbigquery from GitHub with:

devtools::install_github("dgrtwo/stackbigquery")

You'll also need to create a Google Cloud project with BigQuery enabled, and set two environment variables in your .Renviron file (see bigrquery).

BIGQUERY_BILLING_PROJECT=<your_project>
BIGQUERY_EMAIL=<your_email>

The first time you use the package, it may prompt you to authenticate (see the gargle package for more).

Examples

Once you've loaded the stackbigquery package, you can use functions prefixed with stack_ to access the database. This includes

library(dplyr)
library(stackbigquery)

stack_list()
stack_query("SELECT * FROM tags ORDER BY count DESC")

You can also use autocomplete-friendly table accessors:

stack_posts_questions()

These can be used with dbplyr to do joins or summaries.

by_month <- stack_posts_questions() %>%
  group_by(month = DATE_TRUNC(DATE(creation_date), MONTH)) %>%
  summarize(n_questions = n(),
            avg_score = mean(score),
            avg_answers = mean(answer_count)) %>%
  collect()

by_month
library(ggplot2)
theme_set(theme_light())

by_month %>%
  filter(n_questions >= 100) %>%
  ggplot(aes(month, avg_score)) +
  geom_line() +
  labs(y = "Average score of Stack Overflow questions")

Summarize tags

As a database-specific package, stackbigquery also offers useful verbs for doing common operations on the data.

For instance, summarize_tags takes a (potentially grouped) version of stack_posts_questions, joins it to the tags table, and aggregates the frequency by tag.

by_month_tag <- stack_posts_questions() %>%
    group_by(month = DATE_TRUNC(DATE(creation_date), MONTH)) %>%
    summarize_tags(c("javascript", "java", "python", "c#", "php", "c++"))

by_month_tag
library(ggplot2)
library(forcats)

by_month_tag %>%
  filter(month != max(month),
         month != min(month)) %>%
  arrange(month) %>%
  mutate(tag = fct_reorder(tag, -percent, last)) %>%
  ggplot(aes(month, percent, color = tag)) +
  geom_line() +
  scale_y_continuous(labels = scales::percent_format()) +
  expand_limits(y = 0) +
  labs(x = "Time",
         y = "% of Stack Overflow questions")

Code of Conduct

Please note that the 'stackbigquery' project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.



dgrtwo/stackbigquery documentation built on Dec. 19, 2021, 11:06 p.m.