duckplyr: A 'DuckDB'-Backed Version of 'dplyr'

gh-analysis

This directory contains code to analyze GitHub activity data available on Big Query. The data is a few years old, but we do not expect major shifts in the outcome of the analysis by using more recent data.

The data has been extracted using several SQL queries where each generated an intermediate table stored in the tinker dataset. The queries cost about USD 10 to run in total. The dataset contains a much smaller set of sample queries for experimentation.

`tinker.r_repos`

Filter all relevant repositories.

SELECT 
  DISTINCT repo_name
FROM
  `bigquery-public-data.github_repos.languages` l, UNNEST(language)
WHERE
  name = 'R' AND repo_name NOT LIKE 'cran/%'
ORDER BY
  repo_name

`tinker.r_files`

List files from the relevant repositories. Only after downloading and analyzing the data, it became apparent that the output has duplicate id values, possibly from forks which were not excluded in the previous query. This leads to a larger output dataset than necessary, but not to increased query costs.

SELECT
  repo_name, path, id
FROM
  `bigquery-public-data.github_repos.files`
WHERE
  path LIKE '%.R'
  OR path LIKE '%.r'
  AND repo_name IN (SELECT * FROM tinker.r_repos)

`tinker.r_contents`

This is the final dataset, available for download from a GitHub release in this repository, uploaded with the piggyback R package.

SELECT
  r_files.repo_name AS repo_name,
  r_files.path AS path,
  r_files.id AS id,
  content,
  binary
FROM
  tinker.r_files
LEFT JOIN
  `bigquery-public-data.github_repos.contents`
USING
  (id)

The scripts are meant to be run from the project root, in succession, like this: